| samtools-mpileup(1) | Bioinformatics tools | samtools-mpileup(1) |
samtools-mpileup - produces "pileup" textual format from an alignment
samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
Generate text pileup output for one or multiple BAM files. Each input file produces a separate group of pileup columns in the output.
Note that there are two orthogonal ways to specify locations in the input file; via -r region and -l file. The former uses (and requires) an index to do random access while the latter streams through the file contents filtering out the specified regions, requiring no index. The two may be used in conjunction. For example a BED file containing locations of genes in chromosome 20 could be specified using -r 20 -l chr20.bed, meaning that the index is used to find chromosome 20 and then it is filtered for the regions listed in the bed file.
Unmapped reads are not considered and are always discarded. By default secondary alignments, QC failures and duplicate reads will be omitted, along with low quality bases and some reads in high depth regions. See the --ff, -Q and -d options for changing this.
Pileup format consists of TAB-separated lines, with each line representing the pileup of reads at a single genomic position.
Several columns contain numeric quality values encoded as individual ASCII characters. Each character can range from “!” to “~” and is decoded by taking its ASCII value and subtracting 33; e.g., “A” encodes the numeric value 32.
The first three columns give the position and reference:
The remaining columns show the pileup data, and are repeated for each input BAM file specified:
For each read covering the position, this column contains:
| Forward | Reverse | Meaning |
| . dot | , comma | Base matches the reference base |
| ACGTN | acgtn | Base is a mismatch to the reference base |
| > | < | Reference skip (due to CIGAR “N”) |
| * | */# | Deletion of the reference base (CIGAR “D”) |
Deleted bases are shown as “*” on both strands unless --reverse-del is used, in which case they are shown as “#” on the reverse strand.
Any output column that would be empty, such as a tag which is not present or the filtered sequence depth is zero, is reported as "*". This ensures a consistent number of columns across all reported positions.
The exact formula is complex and likely tuned to specific instruments and specific alignment tools, so this option is disabled by default (indicated as having a zero value). Variables in the formulae and their meaning are defined below.
| Variable | Meaning / formula |
| M | The number of matching CIGAR bases (operation "M", "X" or "="). |
| X | The number of substitutions with quality >= 13. |
| SubQ | The summed quality of substitution bases included in X, capped at a maximum of quality 33 per mismatching base. |
| ClipQ | The summed quality of soft-clipped or hard-clipped bases. This has no minimum or maximum quality threshold per base. For hard-clipped bases the per-base quality is taken as 13. |
| T | SubQ - 10 * log10(M^X / X!) + ClipQ/5 |
| Cap | MAX(0, INT * sqrt((INT - T) / INT)) |
Some notes on the impact of this.
The intent of this option is to work around aligners that compute a mapping quality using a local alignment without having any regard to the degree of clipping required or consideration of potential contamination or large scale insertions with respect to the reference. A record may align uniquely and have no close second match, but having a high number of mismatches may still imply that the reference is not the correct site.
However we do not recommend use of this parameter unless you fully understand the impact of it and have determined that it is appropriate for your sequencing technology.
Note that up to release 1.8, samtools would enforce a minimum value for this option. This no longer happens and the limit is set exactly as specified.
Supplying a reference file will enable base alignment quality calculation for all reads aligned to a reference in the file. See BAQ below.
Note base-quality 0 is used as a filtering mechanism for overlap removal which marks bases as having quality zero and lets the base quality filter remove them. Hence using --min-BQ 0 will make the overlapping bases reappear, albeit with quality zero.
When enabled, where the ends of a read-pair overlap the overlapping region will have one base selected and the duplicate base nullified by setting its phred score to zero. It will then be discarded by the --min-BQ option unless this is zero.
The quality values of the retained base within an overlap will be the summation of the two bases if they agree, or 0.8 times the higher of the two bases if they disagree, with the base nucleotide also being the higher confident call.
Output Options:
This option only applies to rows that have at least one read in the pileup, and only to columns for two-letter tags. Columns for empty rows will always be printed as *.
Quality values are from 0 to 255 inclusive, representing a linear scale of probability 0.0 to 1.0 in 1/256ths increments. If quality values are absent (no Ml tag) these are omitted, giving an example string of "C[+m+h]".
Note the base modifications may be identified on the reverse strand, either due to the native ability for this detection by the sequencing instrument or by the sequence subsequently being reverse complemented. This can lead to modification codes, such as "m" meaning 5mC, being shown for their complementary bases, such as "G[-m50]".
When --output-mods is selected base modifications can appear on any base in the sequence output, including during insertions. This may make parsing the string more complex, so also see the --no-output-ins-mods and --no-output-ins options to simplify this process.
Specifying this option twice also removes the "+length" portion, changing the example above to "CCCGCC".
The purpose of this change is to simplify parsing using basic regular expressions, which traditionally cannot perform counting operations. It is particularly beneficial when used in conjunction with --output-mods as the syntax of the inserted sequence is adjusted to also report possible base modifications, but see also --no-output-ins-mods as an alternative.
Specifying this option twice also removes the "-length" portion, changing the example above to "CCCGCC".
The purpose of this change is to simplify parsing using basic regular expressions, which traditionally cannot perform counting operations. See also --no-output-ins.
BAQ (Base Alignment Quality)
BAQ is the Phred-scaled probability of a read base being misaligned. It greatly helps to reduce false SNPs caused by misalignments. BAQ is calculated using the probabilistic realignment method described in the paper “Improving SNP discovery by base alignment quality”, Heng Li, Bioinformatics, Volume 27, Issue 8 <https://doi.org/10.1093/bioinformatics/btr076>
BAQ is applied to modify quality values before the -Q filtering happens and before the choice of which sequence to retain when removing overlaps.
BAQ is turned on when a reference file is supplied using the -f option. To disable it, use the -B option.
It is possible to store precalculated BAQ values in a SAM BQ:Z tag. Samtools mpileup will use the precalculated values if it finds them. The -E option can be used to make it ignore the contents of the BQ:Z tag and force it to recalculate the BAQ scores by making a new alignment.
Using range: With implicit index files in1.bam.<ext> and in2.sam.gz.<ext>,
samtools mpileup in1.bam in2.sam.gz -r chr10:100000-200000
With explicit index files,
samtools mpileup in1.bam in2.sam.gz idx/in1.csi idx/in2.csi -X -r chr10:100000-200000
With fofn being a file of input file names, and implicit index files present with inputs,
samtools mpileup -b fofn -r chr10:100000-200000
Using flags: To get reads with flags READ2 or REVERSE and not having any of SECONDARY,QCFAIL,DUP,
samtools mpileup --rf READ2,REVERSE in.sam
or
samtools mpileup --rf 144 in.sam
To get reads with flag SECONDARY,
samtools mpileup --rf SECONDARY --ff QCFAIL,DUP in.sam
Using all possible alignmentes: To show all possible alignments, either of below two equivalent commands may be used,
samtools mpileup --count-orphans --no-BAQ --max-depth 0 --fasta-ref ref_file.fa \ --min-BQ 0 --excl-flags 0 --disable-overlap-removal in.sam samtools mpileup -A -B -d 0 -f ref_file.fa -Q 0 --ff 0 -x in.sam
Written by Heng Li from the Sanger Institute.
samtools(1), samtools-depth(1), samtools-sort(1), bcftools(1)
Samtools website: <http://www.htslib.org/>
| 12 September 2024 | samtools-1.21 |