AdapterRemoval - Fast short-read adapter trimming and
processing
AdapterRemoval [options...] --file1
<filenames> [--file2 <filenames>]
AdapterRemoval removes residual adapter sequences from
single-end (SE) or paired-end (PE) FASTQ reads, optionally trimming Ns and
low qualities bases and/or collapsing overlapping paired-end mates into one
read. Low quality reads are filtered based on the resulting length and the
number of ambigious nucleotides ('N') present following trimming. These
operations may be combined with simultaneous demultiplexing using 5' barcode
sequences. Alternatively, AdapterRemoval may attempt to reconstruct a
consensus adapter sequences from paired-end data, in order to allow the
identification of the adapter sequences originally used.
If you use this program, please cite the paper:
Schubert, Lindgreen, and Orlando (2016). AdapterRemoval
v2: rapid adapter trimming, identification, and read merging. BMC Research
Notes, 12;
9(1):88
http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2
For detailed documentation, please see
http://adapterremoval.readthedocs.io/en/v2.2.3/
- --help
- Display summary of command-line options.
- --file1 filename
[filenames...]
- Read FASTQ reads from one or more files, either uncompressed, bzip2
compressed, or gzip compressed. This contains either the single-end (SE)
reads or, if paired-end, the mate 1 reads. If running in paired-end mode,
both --file1 and --file2 must be set. See the primary
documentation for a list of supported formats.
- --identify-adapters
- Attempt to build a consensus adapter sequence from fully overlapping pairs
of paired-end reads. The minimum overlap is controlled by
--minalignmentlength. The result will be compared with the values
set using --adapter1 and --adapter2. No trimming is
performed in this mode. Default is off.
- --qualitybase
base
- The Phred quality scores encoding used in input reads - either '64' for
Phred+64 (Illumina 1.3+ and 1.5+) or '33' for Phred+33 (Illumina 1.8+). In
addition, the value 'solexa' may be used to specify reads with Solexa
encoded scores. Default is 33.
- --qualitybase-output
base
- The base of the quality score for reads written by AdapterRemoval - either
'64' for Phred+64 (i.e., Illumina 1.3+ and 1.5+) or '33' for Phred+33
(Illumina 1.8+). In addition, the value 'solexa' may be used to specify
reads with Solexa encoded scores. However, note that quality scores are
represented using Phred scores internally, and conversion to and from
Solexa scores therefore result in a loss of information. The default
corresponds to the value given for --qualitybase.
- --qualitymax
base
- Specifies the maximum Phred score expected in input files, and used when
writing output files. Possible values are 0 to 93 for Phred+33 encoded
files, and 0 to 62 for Phred+64 encoded files. Defaults to 41.
- --interleaved
- Enables --interleaved-input and --interleaved-output.
- --interleaved-input
- If set, input is expected to be a interleaved FASTQ files specified using
--file1, in which pairs of reads are written one after the other
(e.g. read1/1, read1/2, read2/1, read2/2, etc.).
- --interleaved-ouput
- Write paired-end reads to a single file, interleaving mate 1 and mate 2
reads. By default, this file is named basename.paired.truncated,
but this may be changed using the --output1 option.
- --combined-output
- Write all reads into the files specified by --output1 and
--output2. The sequences of reads discarded due to quality filters
or read merging are replaced with a single 'N' with Phred score 0. This
option can be combined with --interleaved-output to write PE reads
to a single output file specified with --output1.
- --basename
filename
- Prefix used for the naming output files, unless these names have been
overridden using the corresponding command-line option (see below).
- --settings
file
- Output file containing information on the parameters used in the run as
well as overall statistics on the reads after trimming. Default filename
is 'basename.settings'.
- --output1
file
- Output file containing trimmed mate1 reads. Default filename is
'basename.pair1.truncated' for paired-end reads, 'basename.truncated' for
single-end reads, and 'basename.paired.truncated' for interleaved
paired-end reads.
- --output2
file
- Output file containing trimmed mate 2 reads when
--interleaved-output is not enabled. Default filename is
'basename.pair2.truncated' in paired-end mode.
- --singleton
file
- Output file to which containing paired reads for which the mate has been
discarded. Default filename is 'basename.singleton.truncated'.
- --outputcollapsed
file
- If --collapsed is set, contains overlapping mate-pairs which have been
merged into a single read (PE mode) or reads for which the adapter was
identified by a minimum overlap, indicating that the entire template
molecule is present. This does not include which have subsequently been
trimmed due to low-quality or ambiguous nucleotides. Default filename is
'basename.collapsed'
- --outputcollapsedtruncated
file
- Collapsed reads (see --outputcollapsed) which were trimmed due the
presence of low-quality or ambiguous nucleotides. Default filename is
'basename.collapsed.truncated'.
- --discarded
file
- Contains reads discarded due to the --minlength, --maxlength or --maxns
options. Default filename is 'basename.discarded'.
- --gzip
- If set, all FASTQ files written by AdapterRemoval will be gzip compressed
using the compression level specified using --gzip-level. The
extension ".gz" is added to files for which no filename was
given on the command-line. Defaults to off.
- --gzip-level level
- Determines the compression level used when gzip'ing FASTQ files. Must be a
value in the range 0 to 9, with 0 disabling compression and 9 being the
best compression. Defaults to 6.
- --bzip2
- If set, all FASTQ files written by AdapterRemoval will be bzip2 compressed
using the compression level specified using --bzip2-level. The
extension ".bz2" is added to files for which no filename was
given on the command-line. Defaults to off.
- --bzip2-level level
- Determines the compression level used when bzip2'ing FASTQ files. Must be
a value in the range 1 to 9, with 9 being the best compression. Defaults
to 9.
- --adapter1
adapter
- Adapter sequence expected to be found in mate 1 reads, specified in read
direction. For a detailed description of how to provide the appropriate
adapter sequences, see the "Adapters" section of the online
documentation. Default is
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG.
- --adapter2
adapter
- Adapter sequence expected to be found in mate 2 reads, specified in read
direction. For a detailed description of how to provide the appropriate
adapter sequences, see the "Adapters" section of the online
documentation. Default is
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT.
- --adapter-list
filename
- Read one or more adapter sequences from a table. The first two columns
(separated by whitespace) of each line in the file are expected to
correspond to values passed to --adapter1 and --adapter2. In single-end
mode, only column one is required. Lines starting with '#' are ignored.
When multiple rows are found in the table, AdapterRemoval will try each
adapter (pair), and select the best aligning adapters for each FASTQ read
processed.
- --minadapteroverlap
length
- In single-end mode, reads are only trimmed if the overlap between read and
the adapter is at least X bases long, not counting ambiguous nucleotides
(N); this is independent of the --minalignmentlength when using
--collapse, allowing a conservative selection of putative complete
inserts in single-end mode, while ensuring that all possible adapter
contamination is trimmed. The default is 0.
- --mm mismatchrate
- The allowed fraction of mismatches allowed in the aligned region. If the
value is less than 1, then the value is used directly. If
`--mismatchrate is greater than 1, the rate is set to 1 /
--mismatchrate. The default setting is 3 when trimming adapters,
corresponding to a maximum mismatch rate of 1/3, and 10 when using
--identify-adapters.
- --shift n
- To allow for missing bases in the 5' end of the read, the program can let
the alignment slip --shift bases in the 5' end. This corresponds to
starting the alignment maximum --shift nucleotides into read2 (for
paired-end) or the adapter (for single-end). The default is 2.
- --trim5p n
[n]
- Trim the 5' of reads by a fixed amount after removing adapters, but before
carrying out quality based trimming. Specify one value to trim mate 1 and
mate 2 reads the same amount, or two values separated by a space to trim
each mate different amounts. Off by default.
- --trimns
- Trim consecutive Ns from the 5' and 3' termini. If quality trimming is
also enabled (--trimqualities), then stretches of mixed low-quality
bases and/or Ns are trimmed.
- --maxns n
- Discard reads containing more than --max ambiguous bases ('N')
after trimming. Default is 1000.
- --trimqualities
- Trim consecutive stretches of low quality bases (threshold set by
--minquality) from the 5' and 3' termini. If trimming of Ns is also
enabled (--trimns), then stretches of mixed low-quality bases and
Ns are trimmed.
- --trimwindows
window_size
- Trim low quality bases using a sliding window based approach inspired by
sickle with the given window size. See the "Window based
quality trimming" section of the manual page for a description of
this algorithm.
- --minquality
minimum
- Set the threshold for trimming low quality bases using
--trimqualities and --trimwindows. Default is 2.
- --preserve5p
- If set, bases at the 5p will not be trimmed by --trimns,
--trimqualities, and --trimwindows. Collapsed reads will not
be quality trimmed when this option is enabled.
- --minlength
length
- Reads shorter than this length are discarded following trimming. Defaults
to 15.
- --maxlength
length
- Reads longer than this length are discarded following trimming. Defaults
to 4294967295.
- --collapse
- In paired-end mode, merge overlapping mates into a single and recalculate
the quality scores. In single-end mode, attempt to identify templates for
which the entire sequence is available. In both cases, complete
"collapsed" reads are written with a 'M_' name prefix, and
"collapsed" reads which are trimmed due to quality settings are
written with a 'MT_' name prefix. The overlap needs to be at least
--minalignmentlength nucleotides, with a maximum number of
mismatches determined by --mm.
- --minalignmentlength
length
- The minimum overlap between mate 1 and mate 2 before the reads are
collapsed into one, when collapsing paired-end reads, or when attempting
to identify complete template sequences in single-end mode. Default is
11.
- --seed seed
- When collaping reads at positions where the two reads differ, and the
quality of the bases are identical, AdapterRemoval will select a random
base. This option specifies the seed used for the random number generator
used by AdapterRemoval. This value is also written to the settings file.
Note that setting the seed is not reliable in multithreaded mode, since
the order of operations is non-deterministic.
- --collapse-deterministic
- Enable deterministic mode; currently only affects --collapse, different
overlapping bases with equal quality are set to N quality 0, instead of
being randomly sampled. Setting this option also sets --collapse.
- --collapse-conservatively
- Alternative merging algorithm inspired by FASTQ-join: For matching
overlapping bases, the highest quality score is used. For mismatching
overlapping bases, the highest quality base is used and the quality is set
to the absolute difference in Phred-score between the two bases. For
mismatching bases with identical quality scores, the base is set to 'N'
and the quality score to 0 (Phred-encoded). Setting this option also sets
--collapse.
- --barcode-list
filename
- Perform demultiplxing using table of one or two fixed-length barcodes for
SE or PE reads. The table is expected to contain 2 or 3 columns, the first
of which represent the name of a given sample, and the second and third of
which represent the mate 1 and (optionally) the mate 2 barcode sequence.
For a detailed description, see the "Demultiplexing" section of
the online documentation.
- --barcode-mm-r1
n
- Maximum number of mismatches allowed for the mate 1 barcode; if not set,
this value is equal to the --barcode-mm value; cannot be higher
than the --barcode-mm value.
- --barcode-mm-r2
n
- Maximum number of mismatches allowed for the mate 2 barcode; if not set,
this value is equal to the --barcode-mm value; cannot be higher
than the --barcode-mm value.
- --demultiplex-only
- Only carry out demultiplexing using the list of barcodes supplied with
--barcode-list. No other processing is done.
As of v2.2.2, AdapterRemoval implements sliding window based
approach to quality based base-trimming inspired by sickle. If
window_size is greater than or equal to 1, that number is used as the
window size for all reads. If window_size is a number greater than or
equal to 0 and less than 1, then that number is multiplied by the length of
individual reads to determine the window size. If the window length is zero
or is greater than the current read length, then the read length is used
instead.
Reads are trimmed as follows for a given window size:
- 1.
- The new 5' is determined by locating the first window where both the
average quality and the quality of the first base in the window is greater
than --minquality.
- 2.
- The new 3' is located by sliding the first window right, until the average
quality becomes less than or equal to --minquality. The new 3' is
placed at the last base in that window where the quality is greater than
or equal to --minquality.
- 3.
- If no 5' position could be determined, the read is discarded.
AdapterRemoval exists with status 0 if the program ran
succesfully, and with a non-zero exit code if any errors were encountered.
Do not use the output from AdapterRemoval if the program returned a non-zero
exit code!
Please report any bugs using the AdapterRemoval issue-tracker:
https://github.com/MikkelSchubert/adapterremoval/issues
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or at your
option any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see
<http://www.gnu.org/licenses/>.
Mikkel Schubert; Stinus Lindgreen
2017, Mikkel Schubert; Stinus Lindgreen