samtools - Utilities for the Sequence Alignment/Map (SAM)
format
samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
samtools index aln.sorted.bam
samtools idxstats aln.sorted.bam
samtools flagstat aln.sorted.bam
samtools stats aln.sorted.bam
samtools bedcov aln.sorted.bam
samtools depth aln.sorted.bam
samtools view aln.sorted.bam chr2:20,100,000-20,200,000
samtools merge out.bam in1.bam in2.bam in3.bam
samtools faidx ref.fasta
samtools fqidx ref.fastq
samtools tview aln.sorted.bam ref.fasta
samtools split merged.bam
samtools quickcheck in1.bam in2.cram
samtools dict -a GRCh38 -s "Homo sapiens" ref.fasta
samtools fixmate in.namesorted.sam out.bam
samtools mpileup -C50 -f ref.fasta -r chr3:1,000-2,000 in1.bam
in2.bam
samtools flags PAIRED,UNMAP,MUNMAP
samtools fastq input.bam > output.fastq
samtools fasta input.bam > output.fasta
samtools addreplacerg -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o
output.bam input.bam
samtools collate -o aln.name_collated.bam aln.sorted.bam
samtools depad input.bam
samtools markdup in.algnsorted.bam out.bam
Samtools is a set of utilities that manipulate alignments in the
BAM format. It imports from and exports to the SAM (Sequence Alignment/Map)
format, does sorting, merging and indexing, and allows one to retrieve reads
in any regions swiftly.
Samtools is designed to work on a stream. It regards an input file
`-' as the standard input (stdin) and an output file `-' as the standard
output (stdout). Several commands can thus be combined with Unix pipes.
Samtools always output warning and error messages to the standard error
output (stderr).
Samtools is also able to open a BAM (not SAM) file on a remote FTP
or HTTP server if the BAM file name starts with `ftp://' or `http://'.
Samtools checks the current working directory for the index file and will
download the index upon absence. Samtools does not retrieve the entire
alignment file unless it is asked to do so.
- view
- samtools view [options] in.sam|in.bam|in.cram
[region...]
With no options or regions specified, prints all alignments in
the specified input alignment file (in SAM, BAM, or CRAM format) to
standard output in SAM format (with no header).
You may specify one or more space-separated region
specifications after the input filename to restrict output to only those
alignments which overlap the specified region(s). Use of region
specifications requires a coordinate-sorted and indexed input file (in
BAM or CRAM format).
The -b, -C, -1, -u, -h,
-H, and -c options change the output format from the
default of headerless SAM, and the -o and -U options set
the output file name(s).
The -t and -T options provide additional
reference data. One of these two options is required when SAM input does
not contain @SQ headers, and the -T option is required whenever
writing CRAM output.
The -L, -M, -r, -R, -s,
-q, -l, -m, -f, -F, and -G
options filter the alignments that will be included in the output to
only those alignments that match certain criteria.
The -x and -B options modify the data which is
contained in each alignment.
Finally, the -@ option can be used to allocate
additional threads to be used for compression, and the -? option
requests a long help message.
- REGIONS:
Regions can be specified as: RNAME[:STARTPOS[-ENDPOS]]
and all position coordinates are 1-based.
Important note: when multiple regions are given, some alignments
may be output multiple times if they overlap more than one of the specified
regions.
Examples of region specifications:
- chr1
- Output all alignments mapped to the reference sequence named `chr1' (i.e.
@SQ SN:chr1).
- chr2:1000000
- The region on chr2 beginning at base position 1,000,000 and ending at the
end of the chromosome.
- chr3:1000-2000
- The 1001bp region on chr3 beginning at base position 1,000 and ending at
base position 2,000 (including both end positions).
- '*'
- Output the unmapped reads at the end of the file. (This does not include
any unmapped reads placed on a reference sequence alongside their mapped
mates.)
- .
- Output all alignments. (Mostly unnecessary as not specifying a region at
all has the same effect.)
OPTIONS:
- -b
- Output in the BAM format.
- -C
- Output in the CRAM format (requires -T).
- -1
- Enable fast BAM compression (implies -b).
- -u
- Output uncompressed BAM. This option saves time spent on
compression/decompression and is thus preferred when the output is piped
to another samtools command.
- -h
- Include the header in the output.
- -H
- Output the header only.
- -c
- Instead of printing the alignments, only count them and print the total
number. All filter options, such as -f, -F, and -q,
are taken into account.
- -?
- Output long help and exit immediately.
- -o FILE
- Output to FILE [stdout].
- -U FILE
- Write alignments that are not selected by the various filter
options to FILE. When this option is used, all alignments (or all
alignments intersecting the regions specified) are written to
either the output file or this file, but never both.
- -t FILE
- A tab-delimited FILE. Each line must contain the reference name in
the first column and the length of the reference in the second column,
with one line for each distinct reference. Any additional fields beyond
the second column are ignored. This file also defines the order of the
reference sequences in sorting. If you run: `samtools faidx
<ref.fa>', the resulting index file <ref.fa>.fai can be
used as this FILE.
- -T FILE
- A FASTA format reference FILE, optionally compressed by
bgzip and ideally indexed by samtools faidx. If an
index is not present, one will be generated for you.
- -L FILE
- Only output alignments overlapping the input BED FILE [null].
- -M
- Use the multi-region iterator on the union of the BED file and
command-line region arguments. This avoids re-reading the same regions of
files so can sometimes be much faster. Note this also removes duplicate
sequences. Without this a sequence that overlaps multiple regions
specified on the command line will be reported multiple times.
- -r STR
- Output alignments in read group STR [null]. Note that records with
no RG tag will also be output when using this option. This
behaviour may change in a future release.
- -R FILE
- Output alignments in read groups listed in FILE [null]. Note that
records with no RG tag will also be output when using this option.
This behaviour may change in a future release.
- -q INT
- Skip alignments with MAPQ smaller than INT [0].
- -l STR
- Only output alignments in library STR [null].
- -m INT
- Only output alignments with number of CIGAR bases consuming query sequence
≥ INT [0]
- -f INT
- Only output alignments with all bits set in INT present in the FLAG
field. INT can be specified in hex by beginning with `0x' (i.e.
/^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/)
[0].
- -F INT
- Do not output alignments with any bits set in INT present in the
FLAG field. INT can be specified in hex by beginning with `0x'
(i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/)
[0].
- -G INT
- Do not output alignments with all bits set in INT present in the
FLAG field. This is the opposite of -f such that -f12 -G12
is the same as no filtering at all. INT can be specified in hex by
beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with
`0' (i.e. /^0[0-7]+/) [0].
- -x STR
- Read tag to exclude from output (repeatable) [null]
- -B
- Collapse the backward CIGAR operation.
- -s FLOAT
- Output only a proportion of the input alignments. This subsampling acts in
the same way on all of the alignment records in the same template or read
pair, so it never keeps a read but not its mate.
- The integer and fractional parts of the -s
INT.FRAC option are used separately: the part
after the decimal point sets the fraction of templates/pairs to be kept,
while the integer part is used as a seed that influences which
subset of reads is kept.
- When subsampling data that has previously been subsampled, be sure to use
a different seed value from those used previously; otherwise more reads
will be retained than expected.
- -@ INT
- Number of BAM compression threads to use in addition to main thread
[0].
- -S
- Ignored for compatibility with previous samtools versions. Previously this
option was required if input was in SAM format, but now the correct format
is automatically detected by examining the first few characters of
input.
- sort
- samtools sort [-l level] [-m maxMem]
[-o out.bam] [-O format] [-n]
[-t tag] [-T tmpprefix] [-@
threads] [in.sam|in.bam|in.cram]
Sort alignments by leftmost coordinates, or by read name when
-n is used. An appropriate @HD-SO sort order header tag
will be added or an existing one updated if necessary.
The sorted output is written to standard output by default, or
to the specified file (out.bam) when -o is used. This
command will also create temporary files
tmpprefix.%d.bam as needed when the entire
alignment data cannot fit into memory (as controlled via the -m
option).
Options:
- -l INT
- Set the desired compression level for the final output file, ranging from
0 (uncompressed) or 1 (fastest but minimal compression) to 9 (best
compression but slowest to write), similarly to gzip(1)'s
compression level setting.
- If -l is not used, the default compression level will apply.
- -m INT
- Approximately the maximum required memory per thread, specified either in
bytes or with a K, M, or G suffix. [768 MiB]
- To prevent sort from creating a huge number of temporary files, it
enforces a minimum value of 1M for this setting.
- -n
- Sort by read names (i.e., the QNAME field) rather than by
chromosomal coordinates.
- -t TAG
- Sort first by the value in the alignment tag TAG, then by position or name
(if also using -n). -o FILE Write the final sorted
output to FILE, rather than to standard output.
- -O FORMAT
- Write the final output as sam, bam, or cram.
By default, samtools tries to select a format based on the
-o filename extension; if output is to standard output or no
format can be deduced, bam is selected.
- -T PREFIX
- Write temporary files to PREFIX.nnnn.bam, or
if the specified PREFIX is an existing directory, to
PREFIX/samtools.mmm.mmm.tmp.nnnn.bam,
where mmm is unique to this invocation of the sort
command.
- By default, any temporary files are written alongside the output file, as
out.bam.tmp.nnnn.bam, or if output is to
standard output, in the current directory as
samtools.mmm.mmm.tmp.nnnn.bam.
- -@ INT
- Set number of sorting and compression threads. By default, operation is
single-threaded.
Ordering Rules
The following rules are used for ordering records.
If option -t is in use, records are first sorted by the
value of the given alignment tag, and then by position or name (if using
-n). For example, “-t RG” will make read group the
primary sort key. The rules for ordering by tag are:
- Records that do not have the tag are sorted before ones that do.
- If the types of the tags are different, they will be sorted so that single
character tags (type A) come before array tags (type B), then string tags
(types H and Z), then numeric tags (types f and i).
- Numeric tags (types f and i) are compared by value. Note that comparisons
of floating-point values are subject to issues of rounding and
precision.
- String tags (types H and Z) are compared based on the binary contents of
the tag using the C strcmp(3) function.
- Character tags (type A) are compared by binary character value.
- No attempt is made to compare tags of other types — notably type B
array values will not be compared.
When the -n option is present, records are sorted by name.
Names are compared so as to give a “natural” ordering —
i.e. sections consisting of digits are compared numerically while all other
sections are compared based on their binary representation. This means
“a1” will come before “b1” and
“a9” will come before “a10”. Records with the
same name will be ordered according to the values of the READ1 and READ2
flags (see flags).
When the -n option is not present, reads are sorted
by reference (according to the order of the @SQ header records), then by
position in the reference, and then by the REVERSE flag.
Note
Historically samtools sort also accepted a less flexible
way of specifying the final and temporary output filenames:
- samtools sort [-f] [-o] in.bam out.prefix
This has now been removed. The previous out.prefix argument
(and -f option, if any) should be changed to an appropriate
combination of -T PREFIX and -o FILE. The
previous -o option should be removed, as output defaults to standard
output.
- index
- samtools index [-bc] [-m INT]
aln.bam|aln.cram [out.index]
Index a coordinate-sorted BAM or CRAM file for fast random
access. (Note that this does not work with SAM files even if they are
bgzip compressed — to index such files, use tabix(1)
instead.)
This index is needed when region arguments are used to
limit samtools view and similar commands to particular regions of
interest.
If an output filename is given, the index file will be written
to out.index. Otherwise, for a CRAM file aln.cram, index
file aln.cram.crai will be created; for a BAM file
aln.bam, either aln.bam.bai or
aln.bam.csi will be created, depending on the index format
selected.
Options:
- -b
- Create a BAI index. This is currently the default when no format options
are used.
- -c
- Create a CSI index. By default, the minimum interval size for the index is
2^14, which is the same as the fixed value used by the BAI format.
- -m INT
- Create a CSI index, with a minimum interval size of 2^INT.
- idxstats
- samtools idxstats in.sam|in.bam|in.cram
Retrieve and print stats in the index file corresponding to
the input file. Before calling idxstats, the input BAM file should be
indexed by samtools index.
If run on a SAM or CRAM file or an unindexed BAM file, this
command will still produce the same summary statistics, but does so by
reading through the entire file. This is far slower than using the BAM
indices.
The output is TAB-delimited with each line consisting of
reference sequence name, sequence length, # mapped reads and # unmapped
reads. It is written to stdout.
- flagstat
- samtools flagstat in.sam|in.bam|in.cram
Does a full pass through the input file to calculate and print
statistics to stdout.
Provides counts for each of 13 categories based primarily on
bit flags in the FLAG field. Each category in the output is broken down
into QC pass and QC fail, which is presented as "#PASS +
#FAIL" followed by a description of the category.
The first row of output gives the total number of reads that
are QC pass and fail (according to flag bit 0x200). For example:
122 + 28 in total (QC-passed reads + QC-failed reads)
Which would indicate that there are a total of 150 reads in
the input file, 122 of which are marked as QC pass and 28 of which are
marked as "not passing quality controls"
Following this, additional categories are given for reads
which are:
And finally, two rows are given that additionally filter
on the reference name (RNAME), mate reference name (MRNM), and mapping quality
(MAPQ) fields:
- stats
- samtools stats [options] in.sam|in.bam|in.cram
[region...]
samtools stats collects statistics from BAM files and outputs
in a text format. The output can be visualized graphically using
plot-bamstats.
Options:
- -c, --coverage
MIN,MAX,STEP
- Set coverage distribution to the specified range (MIN, MAX, STEP all given
as integers) [1,1000,1]
- -d, --remove-dups
- Exclude from statistics reads marked as duplicates
- -f, --required-flag
STR|INT
- Required flag, 0 for unset. See also `samtools flags` [0]
- -F, --filtering-flag
STR|INT
- Filtering flag, 0 for unset. See also `samtools flags` [0]
- --GC-depth
FLOAT
- the size of GC-depth bins (decreasing bin size increases memory
requirement) [2e4]
- -h, --help
- This help message
- -i, --insert-size
INT
- Maximum insert size [8000]
- -I, --id
STR
- Include only listed read group or sample name []
- -l, --read-length
INT
- Include in the statistics only reads with the given read length []
- -m, --most-inserts
FLOAT
- Report only the main part of inserts [0.99]
- -P, --split-prefix
STR
- A path or string prefix to prepend to filenames output when creating
categorised statistics files with -S/--split. [input
filename]
- -q, --trim-quality
INT
- The BWA trimming parameter [0]
- -r, --ref-seq
FILE
- Reference sequence (required for GC-depth and mismatches-per-cycle
calculation). []
- -S, --split
TAG
- In addition to the complete statistics, also output categorised statistics
based on the tagged field TAG (e.g., use --split RG to split
into read groups).
Categorised statistics are written to files named
<prefix>_<value>.bamstat, where prefix
is as given by --split-prefix (or the input filename by default)
and value has been encountered as the specified tagged field's
value in one or more alignment records.
- -t, --target-regions
FILE
- Do stats in these regions only. Tab-delimited file chr,from,to, 1-based,
inclusive. []
- -x, --sparse
- Suppress outputting IS rows where there are no insertions.
- bedcov
- samtools bedcov [options] region.bed
in1.sam|in1.bam|in1.cram[...]
Reports the total read base count (i.e. the sum of per base
read depths) for each genomic region specified in the supplied BED file.
The regions are output as they appear in the BED file and are 0-based.
Counts for each alignment file supplied are reported in separate
columns.
Options:
- -Q INT
- Only count reads with mapping quality greater than INT
- -j
- Do not include deletions (D) and ref skips (N) in bedcov computation.
- depth
- samtools depth [options]
[in1.sam|in1.bam|in1.cram
[in2.sam|in2.bam|in2.cram] [...]]
Computes the depth at each position or region.
Options:
- -a
- Output all positions (including those with zero depth)
- -a -a, -aa
- Output absolutely all positions, including unused reference sequences.
Note that when used in conjunction with a BED file the -a option may
sometimes operate as if -aa was specified if the reference sequence has
coverage outside of the region specified in the BED file.
- -b FILE
- Compute depth at list of positions or regions in specified BED
FILE. []
- -f FILE
- Use the BAM files specified in the FILE (a file of filenames, one
file per line) []
- -l INT
- Ignore reads shorter than INT
- -m, -d
INT
- Truncate reported depth at a maximum of INT reads. [8000]. If 0,
depth is set to the maximum integer value, effectively removing any depth
limit.
- -q INT
- Only count reads with base quality greater than INT
- -Q INT
- Only count reads with mapping quality greater than INT
- -r CHR:FROM-TO
- Only report depth in specified region.
- merge
- samtools merge [-nur1f] [-h inh.sam] [-R reg] [-b <list>]
<out.bam> <in1.bam> [<in2.bam> <in3.bam> ...
<inN.bam>]
Merge multiple sorted alignment files, producing a single
sorted output file that contains all the input records and maintains the
existing sort order.
If -h is specified the @SQ headers of input files will
be merged into the specified header, otherwise they will be merged into
a composite header created from the input headers. If in the process of
merging @SQ lines for coordinate sorted input files, a conflict arises
as to the order (for example input1.bam has @SQ for a,b,c and input2.bam
has b,a,c) then the resulting output file will need to be re-sorted back
into coordinate order.
Unless the -c or -p flags are specified then
when merging @RG and @PG records into the output header then any IDs
found to be duplicates of existing IDs in the output header will have a
suffix appended to them to differentiate them from similar header
records from other files and the read records will be updated to reflect
this.
The ordering of the records in the input files must match the
usage of the -n and -t command-line options. If they do
not, the output order will be undefined. See sort for information
about record ordering.
OPTIONS:
- -1
- Use zlib compression level 1 to compress the output.
- -b FILE
- List of input BAM files, one file per line.
- -f
- Force to overwrite the output file if present.
- -h FILE
- Use the lines of FILE as `@' headers to be copied to
out.bam, replacing any header lines that would otherwise be copied
from in1.bam. (FILE is actually in SAM format, though any
alignment records it may contain are ignored.)
- -n
- The input alignments are sorted by read names rather than by chromosomal
coordinates
- -t TAG
- The input alignments have been sorted by the value of TAG, then by either
position or name (if -n is given).
- -R STR
- Merge files in the specified region indicated by STR [null]
- -r
- Attach an RG tag to each alignment. The tag value is inferred from file
names.
- -u
- Uncompressed BAM output
- -c
- When several input files contain @RG headers with the same ID, emit only
one of them (namely, the header line from the first file we find that ID
in) to the merged output file. Combining these similar headers is usually
the right thing to do when the files being merged originated from the same
file.
Without -c, all @RG headers appear in the output file,
with random suffixes added to their IDs where necessary to differentiate
them.
- -p
- Similarly, for each @PG ID in the set of files to merge, use the @PG line
of the first file we find that ID in rather than adding a suffix to
differentiate similar IDs.
- faidx
- samtools faidx <ref.fasta> [region1 [...]]
Index reference sequence in the FASTA format or extract
subsequence from indexed reference sequence. If no region is specified,
faidx will index the file and create <ref.fasta>.fai
on the disk. If regions are specified, the subsequences will be
retrieved and printed to stdout in the FASTA format.
The input file can be compressed in the BGZF
format.
The sequences in the input file should all have different
names. If they do not, indexing will emit a warning about duplicate
sequences and retrieval will only produce subsequences from the first
sequence with the duplicated name.
FASTQ files can be read and indexed by this command. Without
using --fastq any extracted subsequence will be in FASTA
format.
Options
- -o, --output
FILE
- Write FASTA to file rather than to stdout.
- -n, --length
INT
- Length of FASTA sequence line. [60]
- -c, --continue
- Continue working if a non-existant region is requested.
- -r, --region-file
FILE
- Read regions from a file. Format is chr:from-to, one per line.
- -f, --fastq
- Read FASTQ files and output extracted sequences in FASTQ format. Same as
using samtools fqidx.
- -i,
--reverse-complement
- Output the sequence as the reverse complement. When this option is used,
“/rc” will be appended to the sequence names. To turn this
off or change the string appended, use the --mark-strand
option.
- --mark-strand
TYPE
- Append strand indicator to sequence name. TYPE can be one of:
- rc
- Append '/rc' when writing the reverse complement. This is the
default.
- no
- Do not append anything.
- sign
- Append '(+)' for forward strand or '(-)' for reverse complement. This
matches the output of “bedtools getfasta -s”.
- custom,<pos>,<neg>
- Append string <pos> to names when writing the forward strand and
<neg> when writing the reverse strand. Spaces are preserved, so it
is possible to move the indicator into the comment part of the description
line by including a leading space in the strings <pos> and
<neg>.
- -h, --help
- Print help message and exit.
- fqidx
- samtools fqidx <ref.fastq> [region1 [...]]
Index reference sequence in the FASTQ format or extract
subsequence from indexed reference sequence. If no region is specified,
fqidx will index the file and create <ref.fastq>.fai
on the disk. If regions are specified, the subsequences will be
retrieved and printed to stdout in the FASTQ format.
The input file can be compressed in the BGZF
format.
The sequences in the input file should all have different
names. If they do not, indexing will emit a warning about duplicate
sequences and retrieval will only produce subsequences from the first
sequence with the duplicated name.
samtools fqidx should only be used on fastq files with
a small number of entries. Trying to use it on a file containing
millions of short sequencing reads will produce an index that is almost
as big as the original file, and searches using the index will be very
slow and use a lot of memory.
Options
- -o, --output
FILE
- Write FASTQ to file rather than to stdout.
- -n, --length
INT
- Length of FASTQ sequence line. [60]
- -c, --continue
- Continue working if a non-existant region is requested.
- -r, --region-file
FILE
- Read regions from a file. Format is chr:from-to, one per line.
- -i,
--reverse-complement
- Output the sequence as the reverse complement. When this option is used,
“/rc” will be appended to the sequence names. To turn this
off or change the string appended, use the --mark-strand
option.
- --mark-strand
TYPE
- Append strand indicator to sequence name. TYPE can be one of:
- rc
- Append '/rc' when writing the reverse complement. This is the
default.
- no
- Do not append anything.
- sign
- Append '(+)' for forward strand or '(-)' for reverse complement. This
matches the output of “bedtools getfasta -s”.
- custom,<pos>,<neg>
- Append string <pos> to names when writing the forward strand and
<neg> when writing the reverse strand. Spaces are preserved, so it
is possible to move the indicator into the comment part of the description
line by including a leading space in the strings <pos> and
<neg>.
- -h, --help
- Print help message and exit.
- tview
- samtools tview [-p chr:pos] [-s STR]
[-d display] <in.sorted.bam> [ref.fasta]
Text alignment viewer (based on the ncurses library). In the
viewer, press `?' for help and press `g' to check the alignment start
from a region in the format like `chr10:10,000,000' or `=10,000,000'
when viewing the same reference sequence.
Options:
- -d display
- Output as (H)tml or (C)urses or (T)ext
- -p chr:pos
- Go directly to this position
- -s STR
- Display only alignments from this sample or read group
- split
- samtools split [options]
merged.sam|merged.bam|merged.cram
Splits a file by read group.
Options:
- -u FILE1
- Put reads with no RG tag or an unrecognised RG tag into FILE1
- -u FILE1:FILE2
- As above, but assigns an RG tag as given in the header of
FILE2
- -f STRING
- Output filename format string (see below) ["%*_%#.%."]
- -v
- Verbose output
Format string expansions:
%% |
% |
%* |
basename |
%# |
@RG index |
%! |
@RG ID |
%. |
output format filename extension |
- quickcheck
- samtools quickcheck [options]
in.sam|in.bam|in.cram [ ... ]
Quickly check that input files appear to be intact. Checks
that beginning of the file contains a valid header (all formats)
containing at least one target sequence and then seeks to the end of the
file and checks that an end-of-file (EOF) is present and intact (BAM
only).
Data in the middle of the file is not read since that would be
much more time consuming, so please note that this command will not
detect internal corruption, but is useful for testing that files are not
truncated before performing more intensive tasks on them.
This command will exit with a non-zero exit code if any input
files don't have a valid header or are missing an EOF block. Otherwise
it will exit successfully (with a zero exit code).
Options:
- -v
- Verbose output: will additionally print the names of all input files that
don't pass the check to stdout. Multiple -v options will cause additional
messages regarding check results to be printed to stderr.
- -q
- Quiet mode: disables warning messages on stderr about files that fail. If
both -q and -v options are used then the appropriate level of -v takes
precedence.
- dict
- samtools dict <ref.fasta|ref.fasta.gz>
Create a sequence dictionary file from a fasta file.
OPTIONS:
- fixmate
- samtools fixmate [-rpcm] [-O format]
in.nameSrt.bam out.bam
Fill in mate coordinates, ISIZE and mate related flags from a
name-sorted alignment.
OPTIONS:
- -r
- Remove secondary and unmapped reads.
- -p
- Disable FR proper pair check.
- -c
- Add template cigar ct tag.
- -m
- Add ms (mate score) tags. These are used by markdup to select the
best reads to keep.
- -O FORMAT
- Write the final output as sam, bam, or cram.
By default, samtools tries to select a format based on the
output filename extension; if output is to standard output or no format
can be deduced, bam is selected.
- mpileup
- samtools mpileup [-EB] [-C capQcoef] [-r
reg] [-f in.fa] [-l list] [-Q
minBaseQ] [-q minMapQ] in.bam [in2.bam
[...]]
Generate pileup for one or multiple BAM files. Alignment
records are grouped by sample (SM) identifiers in @RG header lines. If
sample identifiers are absent, each input file is regarded as one
sample.
Samtools mpileup can still produce VCF and BCF output, but
this feature is deprecated and will be removed in a future release.
Please use bcftools mpileup for this instead. (Documentation on
the deprecated options has been removed from this manual page, but older
versions are available online at
<http://www.htslib.org/doc/>.)
In the pileup format (without
-u or -g), each line represents a genomic
position, consisting of chromosome name, 1-based coordinate, reference
base, the number of reads covering the site, read bases, base qualities
and alignment mapping qualities. Information on match, mismatch, indel,
strand, mapping quality and start and end of a read are all encoded at
the read base column. At this column, a dot stands for a match to the
reference base on the forward strand, a comma for a match on the reverse
strand, a '>' or '<' for a reference skip, `ACGTN' for a mismatch
on the forward strand and `acgtn' for a mismatch on the reverse strand.
A pattern `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion
between this reference position and the next reference position. The
length of the insertion is given by the integer in the pattern, followed
by the inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
represents a deletion from the reference. The deleted bases will be
presented as `*' in the following lines. Also at the read base column, a
symbol `^' marks the start of a read. The ASCII of the character
following `^' minus 33 gives the mapping quality. A symbol `$' marks the
end of a read segment.
Note that there are two orthogonal ways to specify locations
in the input file; via -r region and -l
file. The former uses (and requires) an index to do random access
while the latter streams through the file contents filtering out the
specified regions, requiring no index. The two may be used in
conjunction. For example a BED file containing locations of genes in
chromosome 20 could be specified using -r 20 -l chr20.bed,
meaning that the index is used to find chromosome 20 and then it is
filtered for the regions listed in the bed file.
Input Options:
- -6, --illumina1.3+
- Assume the quality is in the Illumina 1.3+ encoding.
- -A,
--count-orphans
- Do not skip anomalous read pairs in variant calling.
- -b, --bam-list FILE
- List of input BAM files, one file per line [null]
- -B, --no-BAQ
- Disable base alignment quality (BAQ) computation. See BAQ
below.
- -C, --adjust-MQ INT
- Coefficient for downgrading mapping quality for reads containing excessive
mismatches. Given a read with a phred-scaled probability q of being
generated from the mapped position, the new mapping quality is about
sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if
enabled, the recommended value for BWA is 50. [0]
- -d, --max-depth INT
- At a position, read maximally INT reads per input file. Setting
this limit reduces the amount of memory and time needed to process regions
with very high coverage. Passing zero for this option sets it to the
highest possible value, effectively removing the depth limit. [8000]
Note that up to release 1.8, samtools would enforce a minimum
value for this option. This no longer happens and the limit is set
exactly as specified.
- -E, --redo-BAQ
- Recalculate BAQ on the fly, ignore existing BQ tags. See BAQ
below.
- -f, --fasta-ref FILE
- The faidx-indexed reference file in the FASTA format. The file can
be optionally compressed by bgzip. [null]
Supplying a reference file will enable base alignment quality
calculation for all reads aligned to a reference in the file. See
BAQ below.
- -G, --exclude-RG FILE
- Exclude reads from readgroups listed in FILE (one @RG-ID per line)
- -l, --positions FILE
- BED or position list file containing a list of regions or sites where
pileup or BCF should be generated. Position list files contain two columns
(chromosome and position) and start counting from 1. BED files contain at
least 3 columns (chromosome, start and end position) and are 0-based
half-open.
While it is possible to mix both position-list and BED coordinates in the
same file, this is strongly ill advised due to the differing coordinate
systems. [null]
- -q, -min-MQ INT
- Minimum mapping quality for an alignment to be used [0]
- -Q, --min-BQ INT
- Minimum base quality for a base to be considered [13]
- -r, --region STR
- Only generate pileup in region. Requires the BAM files to be indexed. If
used in conjunction with -l then considers the intersection of the two
requests. STR [all sites]
- -R, --ignore-RG
- Ignore RG tags. Treat all reads in one BAM as one sample.
- --rf, --incl-flags STR|INT
- Required flags: skip reads with mask bits unset [null]
- --ff, --excl-flags STR|INT
- Filter flags: skip reads with mask bits set
[UNMAP,SECONDARY,QCFAIL,DUP]
- -x, --ignore-overlaps
- Disable read-pair overlap detection.
Output Options:
- -o, --output
FILE
- Write pileup output to FILE, rather than the default of standard
output.
(The same short option is used for both the deprecated
--open-prob option and --output . If -o's argument
contains any non-digit characters other than a leading + or - sign, it
is interpreted as --output. Usually the filename extension will
take care of this, but to write to an entirely numeric filename use
-o ./123 or --output 123.)
- -O, --output-BP
- Output base positions on reads.
- -s,
--output-MQ
- Output mapping quality.
- --output-QNAME
- Output an extra column containing comma-separated read names.
- -a
- Output all positions, including those with zero depth.
- -a -a, -aa
- Output absolutely all positions, including unused reference sequences.
Note that when used in conjunction with a BED file the -a option may
sometimes operate as if -aa was specified if the reference sequence has
coverage outside of the region specified in the BED file.
BAQ (Base Alignment Quality)
BAQ is the Phred-scaled probability of a read base being
misaligned. It greatly helps to reduce false SNPs caused by misalignments.
BAQ is calculated using the probabilistic realignment method described in
the paper “Improving SNP discovery by base alignment quality”,
Heng Li, Bioinformatics, Volume 27, Issue 8
<https://doi.org/10.1093/bioinformatics/btr076>
BAQ is turned on when a reference file is supplied using the
-f option. To disable it, use the -B option.
It is possible to store pre-calculated BAQ values in a SAM BQ:Z
tag. Samtools mpileup will use the precalculated values if it finds them.
The -E option can be used to make it ignore the contents of the BQ:Z
tag and force it to recalculate the BAQ scores by making a new
alignment.
- flags
- samtools flags INT|STR[,...]
Convert between textual and numeric flag representation.
FLAGS:
0x1 |
PAIRED |
paired-end (or multiple-segment) sequencing technology |
0x2 |
PROPER_PAIR |
each segment properly aligned according to the aligner |
0x4 |
UNMAP |
segment unmapped |
0x8 |
MUNMAP |
next segment in the template unmapped |
0x10 |
REVERSE |
SEQ is reverse complemented |
0x20 |
MREVERSE |
SEQ of the next segment in the template is reverse complemented |
0x40 |
READ1 |
the first segment in the template |
0x80 |
READ2 |
the last segment in the template |
0x100 |
SECONDARY |
secondary alignment |
0x200 |
QCFAIL |
not passing quality controls |
0x400 |
DUP |
PCR or optical duplicate |
0x800 |
SUPPLEMENTARY |
supplementary alignment |
- fastq/a
- samtools fastq [options] in.bam
samtools fasta [options] in.bam
Converts a BAM or CRAM into either FASTQ or FASTA format
depending on the command invoked. The files will be automatically
compressed if the file names have a .gz or .bgzf extension.
The input to this program must be collated by name. Use
samtools collate or samtools sort -n to ensure this.
For each different QNAME, the input records are categorised
according to the state of the READ1 and READ2 flag bits. The three
categories used are:
1 : Only READ1 is set.
2 : Only READ2 is set.
0 : Either both READ1 and READ2 are set; or neither is
set.
The exact meaning of these categories depends on the
sequencing technology used. It is expected that ordinary single and
paired-end sequencing reads will be in categories 1 and 2 (in the case
of paired-end reads, one read of the pair will be in category 1, the
other in category 2). Category 0 is essentially a
“catch-all” for reads that do not fit into a simple
paired-end sequencing model.
For each category only one sequence will be written for a
given QNAME. If more than one record is available for a given QNAME and
category, the first in input file order that has quality values will be
used. If none of the candidate records has quality values, then the
first in input file order will be used instead.
Sequences will be written to standard output unless one of the
-1,-2, or -0 options is used, in which case
sequences for that category will be written to the specified file.
If a singleton file is specified using the -s option
then only paired sequences will be output for categories 1 and 2; paired
meaning that for a given QNAME there are sequences for both category 1
and 2. If there is a sequence for only one of categories 1 or 2
then it will be diverted into the specified singletons file. This can be
used to prepare fastq files for programs that cannot handle a mixture of
paired and singleton reads.
The -s option only affects category 1 and 2 records.
The output for category 0 will be the same irrespective of the use of
this option.
OPTIONS:
- -n
- By default, either '/1' or '/2' is added to the end of read names where
the corresponding READ1 or READ2 FLAG bit is set. Using -n causes
read names to be left as they are.
- -N
- Always add either '/1' or '/2' to the end of read names even when put into
different files.
- -O
- Use quality values from OQ tags in preference to standard quality string
if available.
- -s FILE
- Write singleton reads to FILE.
- -t
- Copy RG, BC and QT tags to the FASTQ header line, if they exist.
- -T TAGLIST
- Specify a comma-separated list of tags to copy to the FASTQ header line,
if they exist.
- -1 FILE
- Write reads with the READ1 FLAG set (and READ2 not set) to FILE instead of
outputting them. If the -s option is used, only paired reads will
be written to this file.
- -2 FILE
- Write reads with the READ2 FLAG set (and READ1 not set) to FILE instead of
outputting them. If the -s option is used, only paired reads will
be written to this file.
- -0 FILE
- Write reads where the READ1 and READ2 FLAG bits set are either both set or
both unset to FILE instead of outputting them.
- -f INT
- Only output alignments with all bits set in INT present in the FLAG
field. INT can be specified in hex by beginning with `0x' (i.e.
/^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/)
[0].
- -F INT
- Do not output alignments with any bits set in INT present in the
FLAG field. INT can be specified in hex by beginning with `0x'
(i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/)
[0].
- -G INT
- Only EXCLUDE reads with all of the bits set in INT present in the
FLAG field. INT can be specified in hex by beginning with `0x'
(i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/)
[0].
- -i
- add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
- -c [0..9]
- set compression level when writing gz or bgzf fastq files.
- --i1 FILE
- write first index reads to FILE
- --i2 FILE
- write second index reads to FILE
- --barcode-tag
TAG
- aux tag to find index reads in [default: BC]
- --quality-tag
TAG
- aux tag to find index quality in [default: QT]
- --index-format
STR
- string to describe how to parse the barcode and quality tags. For example:
- i14i8
- the first 14 characters are index 1, the next 8 characters are index
2
- n8i14
- ignore the first 8 characters, and use the next 14 characters for index 1
If the tag contains a separator, then the numeric part can be
replaced with '*' to mean 'read until the separator or end of tag', for
example:
- n*i*
- ignore the left part of the tag until the separator, then use the second
part
EXAMPLES
Output paired reads to separate files, discarding singletons,
supplementary and secondary reads. The resulting files can be used with, for
example, the bwa aligner.
samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n -F 0x900 in.bam
Output paired and singleton reads in a single file, discarding
supplementary and secondary reads. To get all of the reads in a single file,
it is necessary to redirect the output of samtools fastq. The output file is
suitable for use with bwa mem -p which understands interleaved files
containing a mixture of paired and singleton reads.
samtools fastq -0 /dev/null -F 0x900 in.bam > all_reads.fq
Output paired reads in a single file, discarding supplementary and
secondary reads. Save any singletons in a separate file. Append /1 and /2 to
read names. This format is suitable for use by NextGenMap when using
its -p and -q options. With this aligner, paired reads must be
mapped separately to the singletons.
samtools fastq -0 /dev/null -s single.fq -N -F 0x900 in.bam > paired.fq
BUGS
- o
- The way of specifying output files is far to complicated and easy to get
wrong.
- o
- The default value for the -F option should really be 0x900 so that
secondary and supplementary reads are automatically excluded. The existing
default of 0 is retained for reasons of compatibility.
- collate
- samtools collate [options]
in.sam|in.bam|in.cram [<prefix>]
Shuffles and groups reads together by their names. A faster
alternative to a full query name sort, collate ensures that reads
of the same name are grouped together in contiguous groups, but doesn't
make any guarantees about the order of read names between groups.
The output from this command should be suitable for any
operation that requires all reads from the same template to be grouped
together.
If present, <prefix> is used to name the temporary files
that collate uses when sorting the data. If neither the '-O' nor '-o'
options are used, <prefix> must be present and collate will use it
to make an output file name by appending a suffix depending on the
format written (.bam by default).
If either the -O or -o option is used, <prefix> is
optional. If <prefix> is absent, collate will write the temporary
files to a system-dependent location (/tmp on UNIX).
Using -f for fast mode will output only primary
alignments that have either the READ1 or READ2 flags set (but not
both). Any other alignment records will be filtered out. The collation
will only work correctly if there are no more than two reads for any
given QNAME after filtering.
Fast mode keeps a buffer of alignments in memory so that it
can write out most pairs as soon as they are found instead of storing
them in temporary files. This allows collate to avoid some work and so
finish more quickly compared to the standard mode. The number of
alignments held can be changed using -r, storing more alignments uses
more memory but increases the number of pairs that can be written
early.
While collate normally randomises the ordering of read pairs,
fast mode does not. Position-dependent biases that would normally be
broken up can remain in the fast collate output. It is therefore not a
good idea to use fast mode when preparing data for programs that expect
randomly ordered paired reads. For example using fast collate instead of
the standard mode may lead to significantly different results from
aligners that estimate library insert sizes on batches of reads.
Options:
- -O
- Output to stdout. This option cannot be used with '-o'.
- -o FILE
- Write output to FILE. This option cannot be used with '-O'.
- -u
- Write uncompressed BAM output
- -l INT
- Compression level. [1]
- -n INT
- Number of temporary files to use. [64]
- -f
- Fast mode (primary alignments only).
- -r INT
- Number of reads to store in memory (for use with -f). [10000]
- samtools reheader [-iP] in.header.sam in.bam
Replace the header in in.bam with the header in
in.header.sam. This command is much faster than replacing the
header with a BAM→SAM→BAM conversion.
By default this command outputs the BAM or CRAM file to
standard output (stdout), but for CRAM format files it has the option to
perform an in-place edit, both reading and writing to the same file. No
validity checking is performed on the header, nor that it is suitable to
use with the sequence data itself.
OPTIONS:
- -P, --no-PG
- Do not generate an @PG header line.
- -i, --in-place
- Perform the header edit in-place, if possible. This only works on CRAM
files and only if there is sufficient room to store the new header. The
amount of space available will differ for each CRAM file.
- cat
- samtools cat [-b list] [-h header.sam] [-o out.bam] <in1.bam>
<in2.bam> [ ... ]
Concatenate BAMs or CRAMs. Although this works on either BAM
or CRAM, all input files must be the same format as each other. The
sequence dictionary of each input file must be identical, although this
command does not check this. This command uses a similar trick to
reheader which enables fast BAM concatenation.
OPTIONS:
- -b FOFN
- Read the list of input BAM or CRAM files from FOFN. These are
concatenated prior to any files specified on the command line. Multiple
-b FOFN options may be specified to concatenate multiple
lists of BAM/CRAM files.
- -h FILE
- Uses the SAM header from FILE. By default the header is taken from
the first file to be concatenated.
- -o FILE
- Write the concatenated output to FILE. By default this is sent to
stdout.
- rmdup
- samtools rmdup [-sS] <input.srt.bam> <out.bam>
This command is obsolete. Use markdup instead.
Remove potential PCR duplicates: if multiple read pairs have
identical external coordinates, only retain the pair with highest
mapping quality. In the paired-end mode, this command ONLY works
with FR orientation and requires ISIZE is correctly set. It does not
work for unpaired reads (e.g. two ends mapped to different chromosomes
or orphan reads).
OPTIONS:
- -s
- Remove duplicates for single-end reads. By default, the command works for
paired-end reads only.
- -S
- Treat paired-end reads and single-end reads.
- addreplacerg
- samtools addreplacerg [-r rg line | -R rg ID] [-m mode] [-l level] [-o
out.bam] <input.bam>
Adds or replaces read group tags in a file.
OPTIONS:
- -r STRING
- Allows you to specify a read group line to append to the header and
applies it to the reads specified by the -m option. If repeated it
automatically adds in tabs between invocations.
- -R STRING
- Allows you to specify the read group ID of an existing @RG line and
applies it to the reads specified.
- -m MODE
- If you choose orphan_only then existing RG tags are not overwritten, if
you choose overwrite_all, existing RG tags are overwritten. The default is
overwrite_all.
- -o STRING
- Write the final output to STRING. The default is to write to stdout.
By default, samtools tries to select a format based on the
output filename extension; if output is to standard output or no format
can be deduced, bam is selected.
- calmd
- samtools calmd [-Eeubr] [-C capQcoef] <aln.bam> <ref.fasta>
Generate the MD tag. If the MD tag is already present, this
command will give a warning if the MD tag generated is different from
the existing tag. Output SAM by default.
Calmd can also read and write CRAM files although in most
cases it is pointless as CRAM recalculates MD and NM tags on the fly.
The one exception to this case is where both input and output CRAM files
have been / are being created with the no_ref option.
OPTIONS:
- -A
- When used jointly with -r this option overwrites the original base
quality.
- -e
- Convert a the read base to = if it is identical to the aligned reference
base. Indel caller does not support the = bases at the moment.
- -u
- Output uncompressed BAM
- -b
- Output compressed BAM
- -C INT
- Coefficient to cap mapping quality of poorly mapped reads. See the
pileup command for details. [0]
- -r
- Compute the BQ tag (without -A) or cap base quality by BAQ (with -A).
- -E
- Extended BAQ calculation. This option trades specificity for sensitivity,
though the effect is minor.
- targetcut
- samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2]
[-f ref] <in.bam>
This command identifies target regions by examining the
continuity of read depth, computes haploid consensus sequences of
targets and outputs a SAM with each sequence corresponding to a target.
When option -f is in use, BAQ will be applied. This command is
only designed for cutting fosmid clones from fosmid pool
sequencing [Ref. Kitzman et al. (2010)].
- phase
- samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ]
<in.bam>
Call and phase heterozygous SNPs.
OPTIONS:
- -A
- Drop reads with ambiguous phase.
- -b STR
- Prefix of BAM output. When this option is in use, phase-0 reads will be
saved in file STR.0.bam and phase-1 reads in STR.1.bam.
Phase unknown reads will be randomly allocated to one of the two files.
Chimeric reads with switch errors will be saved in
STR.chimeric.bam. [null]
- -F
- Do not attempt to fix chimeric reads.
- -k INT
- Maximum length for local phasing. [13]
- -q INT
- Minimum Phred-scaled LOD to call a heterozygote. [40]
- -Q INT
- Minimum base quality to be used in het calling. [13]
- depad
- samtools depad [-SsCu1] [-T ref.fa] [-o output] <in.bam>
Converts a BAM aligned against a padded reference to a BAM
aligned against the depadded reference. The padded reference may contain
verbatim "*" bases in it, but "*" bases are also
counted in the reference numbering. This means that a sequence base-call
aligned against a reference "*" is considered to be a cigar
match ("M" or "X") operator (if the base-call is
"A", "C", "G" or "T"). After
depadding the reference "*" bases are deleted and such aligned
sequence base-calls become insertions. Similarly transformations apply
for deletions and padding cigar operations.
OPTIONS:
- -S
- Ignored for compatibility with previous samtools versions. Previously this
option was required if input was in SAM format, but now the correct format
is automatically detected by examining the first few characters of
input.
- -s
- Output in SAM format. The default is BAM.
- -C
- Output in CRAM format. The default is BAM.
- -u
- Do not compress the output. Applies to either BAM or CRAM output
format.
- -1
- Enable fastest compression level. Only works for BAM or CRAM output.
- -T FILE
- Provides the padded reference file. Note that without this the @SQ line
lengths will be incorrect, so for most use cases this option will be
considered as mandatory.
- -o FILE
- Specifies the output filename. By default output is sent to stdout.
- markdup
- samtools markdup [-l length] [-r] [-s]
[-T] [-S] in.algsort.bam out.bam
Mark duplicate alignments from a coordinate sorted file that
has been run through fixmate with the -m option. This program relies on
the MC and ms tags that fixmate provides.
- -l INT
- Expected maximum read length of INT bases. [300]
- -r
- Remove duplicate reads.
- -s
- Print some basic stats.
- -T PREFIX
- Write temporary files to
PREFIX.samtools.nnnn.mmmm.tmp
- -S
- Mark supplementary reads of duplicates as duplicates.
EXAMPLE
# The first sort can be omitted if the file is already name ordered
samtools sort -n -o namesort.bam example.bam
# Add ms and MC tags for markdup to use later
samtools fixmate -m namesort.bam fixmate.bam
# Markdup needs position order
samtools sort -o positionsort.bam fixmate.bam
# Finally mark duplicates
samtools markdup positionsort.bam markdup.bam
- help, --help
- Display a brief usage message listing the samtools commands available. If
the name of a command is also given, e.g.,
samtools help view, the detailed usage message for
that particular command is displayed.
- --version
- Display the version numbers and copyright information for samtools and the
important libraries used by samtools.
- --version-only
- Display the full samtools version number in a machine-readable
format.
Several long-options are shared between multiple samtools
subcommands: --input-fmt, --input-fmt-option,
--output-fmt, --output-fmt-option, and --reference. The
input format is typically auto-detected so specifying the format is usually
unnecessary and the option is included for completeness. Note that not all
subcommands have all options. Consult the subcommand help for more
details.
Format strings recognised are "sam", "bam" and
"cram". They may be followed by a comma separated list of options
as key or key=value. See below for examples.
The fmt-option arguments accept either a single
option or option=value. Note that some options only
work on some file formats and only on read or write streams. If value is
unspecified for a boolean option, the value is assumed to be 1. The valid
options are as follows.
- level=INT
- Output only. Specifies the compression level from 1 to 9, or 0 for
uncompressed.
- nthreads=INT
- Specifies the number of threads to use during encoding and/or decoding.
For BAM this will be encoding only. In CRAM the threads are dynamically
shared between encoder and decoder.
- reference=fasta_file
- Specifies a FASTA reference file for use in CRAM encoding or decoding. It
usually is not required for decoding except in the situation of the MD5
not being obtainable via the REF_PATH or REF_CACHE environment
variables.
- decode_md=0|1
- CRAM input only; defaults to 1 (on). CRAM does not typically store MD and
NM tags, preferring to generate them on the fly. This option controls this
behaviour. It can be particularly useful when combined with a file encoded
using store_md=1 and store_nm=1.
- store_md=0|1
- CRAM output only; defaults to 0 (off). CRAM normally only stores MD tags
when no reference is unknown and lets the decoder generate these values
on-the-fly (see decode_md).
- store_nm=0|1
- CRAM output only; defaults to 0 (off). CRAM normally only stores NM tags
when no reference is unknown and lets the decoder generate these values
on-the-fly (see decode_md).
- ignore_md5=0|1
- CRAM input only; defaults to 0 (off). When enabled, md5 checksum errors on
the reference sequence and block checksum errors within CRAM are ignored.
Use of this option is strongly discouraged.
- required_fields=bit-field
- CRAM input only; specifies which SAM columns need to be populated. By
default all fields are used. Limiting the decode to specific columns can
have significant performance gains. The bit-field is a numerical value
constructed from the following table.
0x1 |
SAM_QNAME |
0x2 |
SAM_FLAG |
0x4 |
SAM_RNAME |
0x8 |
SAM_POS |
0x10 |
SAM_MAPQ |
0x20 |
SAM_CIGAR |
0x40 |
SAM_RNEXT |
0x80 |
SAM_PNEXT |
0x100 |
SAM_TLEN |
0x200 |
SAM_SEQ |
0x400 |
SAM_QUAL |
0x800 |
SAM_AUX |
0x1000 |
SAM_RGAUX |
- name_prefix=string
- CRAM input only; defaults to output filename. Any sequences with
auto-generated read names will use string as the name prefix.
- multi_seq_per_slice=0|1
- CRAM output only; defaults to 0 (off). By default CRAM generates one
container per reference sequence, except in the case of many small
references (such as a fragmented assembly).
- version=major.minor
- CRAM output only. Specifies the CRAM version number. Acceptable values are
"2.1" and "3.0".
- seqs_per_slice=INT
- CRAM output only; defaults to 10000.
- slices_per_container=INT
- CRAM output only; defaults to 1. The effect of having multiple slices per
container is to share the compression header block between multiple
slices. This is unlikely to have any significant impact unless the number
of sequences per slice is reduced. (Together these two options control the
granularity of random access.)
- embed_ref=0|1
- CRAM output only; defaults to 0 (off). If 1, this will store portions of
the reference sequence in each slice, permitting decode without having
requiring an external copy of the reference sequence.
- no_ref=0|1
- CRAM output only; defaults to 0 (off). If 1, sequences will be stored
verbatim with no reference encoding. This can be useful if no reference is
available for the file.
- use_bzip2=0|1
- CRAM output only; defaults to 0 (off). Permits use of bzip2 in CRAM block
compression.
- use_lzma=0|1
- CRAM output only; defaults to 0 (off). Permits use of lzma in CRAM block
compression.
- lossy_names=0|1
- CRAM output only; defaults to 0 (off). If 1, templates with all members
within the same CRAM slice will have their read names removed. New names
will be automatically generated during decoding. Also see the
name_prefix option.
For example:
samtools view --input-fmt-option decode_md=0
--output-fmt cram,version=3.0 --output-fmt-option embed_ref
--output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam
The CRAM format requires use of a reference sequence for both
reading and writing.
When reading a CRAM the @SQ headers are interrogated to
identify the reference sequence MD5sum (M5: tag) and the local
reference sequence filename (UR: tag). Note that http:// and
ftp:// based URLs in the UR: field are not used, but local fasta
filenames (with or without file://) can be used.
To create a CRAM the @SQ headers will also be read to
identify the reference sequences, but M5: and UR: tags may not be present.
In this case the -T and -t options of samtools view may be
used to specify the fasta or fasta.fai filenames respectively (provided the
.fasta.fai file is also backed up by a .fasta file).
The search order to obtain a reference is:
- 1.
- Use any local file specified by the command line options (eg -T).
- 2.
- Look for MD5 via REF_CACHE environment variable.
- 3.
- Look for MD5 in each element of the REF_PATH environment variable.
- 4.
- Look for a local file listed in the UR: header tag.
- HTS_PATH
- A colon-separated list of directories in which to search for HTSlib
plugins. If $HTS_PATH starts or ends with a colon or contains a double
colon (::), the built-in list of directories is searched at that
point in the search.
If no HTS_PATH variable is defined, the built-in list of
directories specified when HTSlib was built is used, which typically
includes /usr/local/libexec/htslib and similar directories.
- REF_PATH
- A colon separated (semi-colon on Windows) list of locations in which to
look for sequences identified by their MD5sums. This can be either a list
of directories or URLs. Note that if a URL is included then the colon in
http:// and ftp:// and the optional port number will be treated as part of
the URL and not a PATH field separator. For URLs, the text %s will
be replaced by the MD5sum being read.
If no REF_PATH has been specified it will default to
http://www.ebi.ac.uk/ena/cram/md5/%s and if REF_CACHE is also
unset, it will be set to $XDG_CACHE_HOME/hts-ref/%2s/%2s/%s. If
$XDG_CACHE_HOME is unset, $HOME/.cache (or a local system
temporary directory if no home directory is found) will be used
similarly.
- REF_CACHE
- This can be defined to a single directory housing a local cache of
references. Upon downloading a reference it will be stored in the location
pointed to by REF_CACHE. When reading a reference it will be looked for in
this directory before searching REF_PATH. To avoid many files being stored
in the same directory, a pathname may be constructed using %nums
and %s notation, consuming num characters of the MD5sum. For
example /local/ref_cache/%2s/%2s/%s will create 2 nested
subdirectories with the filenames in the deepest directory being the last
28 characters of the md5sum.
The REF_CACHE directory will be searched for before attempting
to load via the REF_PATH search list. If no REF_PATH is defined, both
REF_PATH and REF_CACHE will be automatically set (see above), but if
REF_PATH is defined and REF_CACHE not then no local cache is used.
To aid population of the REF_CACHE directory a script
misc/seq_cache_populate.pl is provided in the Samtools
distribution. This takes a fasta file or a directory of fasta files and
generates the MD5sum named files.
- o
- Import SAM to BAM when @SQ lines are present in the header:
samtools view -bS aln.sam > aln.bam
If @SQ lines are absent:
samtools faidx ref.fa
samtools view -bt ref.fa.fai aln.sam > aln.bam
where ref.fa.fai is generated automatically by the
faidx command.
- o
- Convert a BAM file to a CRAM file using a local reference sequence.
samtools view -C -T ref.fa aln.bam > aln.cram
- o
- Attach the RG tag while merging sorted alignments:
perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
samtools merge -rh rg.txt merged.bam ga.bam 454.bam
The value in a RG tag is determined by the file name
the read is coming from. In this example, in the merged.bam,
reads from ga.bam will be attached RG:Z:ga, while reads
from 454.bam will be attached RG:Z:454.
- o
- Convert a BAM file to a CRAM with NM and MD tags stored verbatim rather
than calculating on the fly during CRAM decode, so that mixed data sets
with MD/NM only on some records, or NM calculated using different
definitions of mismatch, can be decoded without change. The second command
demonstrates how to decode such a file. The request to not decode MD here
is turning off auto-generation of both MD and NM; it will still emit the
MD/NM tags on records that had these stored verbatim.
samtools view -C --output-fmt-option store_md=1 --output-fmt-option store_nm=1 -o aln.cram aln.bam
samtools view --input-fmt-option decode_md=0 -o aln.new.bam aln.cram
- o
- An alternative way of achieving the above is listing multiple options
after the --output-fmt or -O option. The commands below are
equivalent to the two above.
samtools view -O cram,store_md=1,store_nm=1 -o aln.cram aln.bam
samtools view --input-fmt cram,decode_md=0 -o aln.new.bam aln.cram
- o
- Call SNPs and short INDELs:
samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf
bcftools filter -s LowQual -e '%QUAL<20 || DP>100' var.raw.vcf > var.flt.vcf
The bcftools filter command marks low quality sites and
sites with the read depth exceeding a limit, which should be adjusted to
about twice the average read depth (bigger read depths usually indicate
problematic regions which are often enriched for artefacts). One may
consider to add -C50 to mpileup if mapping quality is
overestimated for reads containing excessive mismatches. Applying this
option usually helps BWA-short but may not other mappers.
Individuals are identified from the SM tags in the
@RG header lines. Individuals can be pooled in one alignment
file; one individual can also be separated into multiple files. The
-P option specifies that indel candidates should be collected
only from read groups with the @RG-PL tag set to ILLUMINA.
Collecting indel candidates from reads sequenced by an indel-prone
technology may affect the performance of indel calling.
- o
- Generate the consensus sequence for one diploid individual:
samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
- o
- Phase one individual:
samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
The calmd command is used to reduce false heterozygotes
around INDELs.
- o
- Dump BAQ applied alignment for other SNP callers:
samtools calmd -bAr aln.bam > aln.baq.bam
It adds and corrects the NM and MD tags at the
same time. The calmd command also comes with the -C
option, the same as the one in pileup and mpileup. Apply
if it helps.
- o
- Unaligned words used in bam_import.c, bam_endian.h, bam.c and
bam_aux.c.
- o
- Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan
reads or ends mapped to different chromosomes). If this is a concern,
please use Picard's MarkDuplicates which correctly handles these cases,
although a little slower.
Heng Li from the Sanger Institute wrote the original C version of
samtools. Bob Handsaker from the Broad Institute implemented the BGZF
library. James Bonfield from the Sanger Institute developed the CRAM
implementation. John Marshall and Petr Danecek contribute to the source code
and various people from the 1000 Genomes Project have contributed to the SAM
format specification.
bcftools(1), sam(5), tabix(1)
Samtools website: <http://www.htslib.org/>
File format specification of SAM/BAM,CRAM,VCF/BCF:
<http://samtools.github.io/hts-specs>
Samtools latest source: <https://github.com/samtools/samtools>
HTSlib latest source: <https://github.com/samtools/htslib>
Bcftools website: <http://samtools.github.io/bcftools>