AdapterRemoval - Fast short-read adapter trimming and
processing
AdapterRemoval [options…] –file1
<filenames> [–file2 <filenames>]
AdapterRemoval removes residual adapter sequences from
single-end (SE) or paired-end (PE) FASTQ reads, optionally trimming Ns and
low qualities bases and/or collapsing overlapping paired-end mates into one
read. Low quality reads are filtered based on the resulting length and the
number of ambigious nucleotides (‘N’) present following
trimming. These operations may be combined with simultaneous demultiplexing
using 5’ barcode sequences. Alternatively, AdapterRemoval may
attempt to reconstruct a consensus adapter sequences from paired-end data,
in order to allow the identification of the adapter sequences originally
used.
If you use this program, please cite the paper:
Schubert, Lindgreen, and Orlando (2016). AdapterRemoval
v2: rapid adapter trimming, identification, and read merging. BMC Research
Notes, 12;
9(1):88
http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2
For detailed documentation, please see
http://adapterremoval.readthedocs.io/en/v2.2.3/
- --help
- Display summary of command-line options.
- --file1 filename
[filenames...]
- Read FASTQ reads from one or more files, either uncompressed, bzip2
compressed, or gzip compressed. This contains either the single-end (SE)
reads or, if paired-end, the mate 1 reads. If running in paired-end mode,
both --file1 and --file2 must be set. See the primary
documentation for a list of supported formats.
- --identify-adapters
- Attempt to build a consensus adapter sequence from fully overlapping pairs
of paired-end reads. The minimum overlap is controlled by
--minalignmentlength. The result will be compared with the values
set using --adapter1 and --adapter2. No trimming is
performed in this mode. Default is off.
- --qualitybase
base
- The Phred quality scores encoding used in input reads - either
‘64’ for Phred+64 (Illumina 1.3+ and 1.5+) or
‘33’ for Phred+33 (Illumina 1.8+). In addition, the value
‘solexa’ may be used to specify reads with Solexa encoded
scores. Default is 33.
- --qualitybase-output
base
- The base of the quality score for reads written by AdapterRemoval - either
‘64’ for Phred+64 (i.e., Illumina 1.3+ and 1.5+) or
‘33’ for Phred+33 (Illumina 1.8+). In addition, the value
‘solexa’ may be used to specify reads with Solexa encoded
scores. However, note that quality scores are represented using Phred
scores internally, and conversion to and from Solexa scores therefore
result in a loss of information. The default corresponds to the value
given for --qualitybase.
- --qualitymax
base
- Specifies the maximum Phred score expected in input files, and used when
writing output files. Possible values are 0 to 93 for Phred+33 encoded
files, and 0 to 62 for Phred+64 encoded files. Defaults to 41.
- --interleaved
- Enables --interleaved-input and --interleaved-output.
- --interleaved-input
- If set, input is expected to be a interleaved FASTQ files specified using
--file1, in which pairs of reads are written one after the other
(e.g. read1/1, read1/2, read2/1, read2/2, etc.).
- --interleaved-ouput
- Write paired-end reads to a single file, interleaving mate 1 and mate 2
reads. By default, this file is named basename.paired.truncated,
but this may be changed using the --output1 option.
- --combined-output
- Write all reads into the files specified by --output1 and
--output2. The sequences of reads discarded due to quality filters
or read merging are replaced with a single ‘N’ with Phred
score 0. This option can be combined with --interleaved-output to
write PE reads to a single output file specified with
--output1.
- --basename
filename
- Prefix used for the naming output files, unless these names have been
overridden using the corresponding command-line option (see below).
- --settings
file
- Output file containing information on the parameters used in the run as
well as overall statistics on the reads after trimming. Default filename
is ‘basename.settings’.
- --output1
file
- Output file containing trimmed mate1 reads. Default filename is
‘basename.pair1.truncated’ for paired-end reads,
‘basename.truncated’ for single-end reads, and
‘basename.paired.truncated’ for interleaved paired-end
reads.
- --output2
file
- Output file containing trimmed mate 2 reads when
--interleaved-output is not enabled. Default filename is
‘basename.pair2.truncated’ in paired-end mode.
- --singleton
file
- Output file to which containing paired reads for which the mate has been
discarded. Default filename is
‘basename.singleton.truncated’.
- --outputcollapsed
file
- If –collapsed is set, contains overlapping mate-pairs which have
been merged into a single read (PE mode) or reads for which the adapter
was identified by a minimum overlap, indicating that the entire template
molecule is present. This does not include which have subsequently been
trimmed due to low-quality or ambiguous nucleotides. Default filename is
‘basename.collapsed’
- --outputcollapsedtruncated
file
- Collapsed reads (see –outputcollapsed) which were trimmed due the
presence of low-quality or ambiguous nucleotides. Default filename is
‘basename.collapsed.truncated’.
- --discarded
file
- Contains reads discarded due to the –minlength, –maxlength
or –maxns options. Default filename is
‘basename.discarded’.
- --gzip
- If set, all FASTQ files written by AdapterRemoval will be gzip compressed
using the compression level specified using --gzip-level. The
extension “.gz” is added to files for which no filename was
given on the command-line. Defaults to off.
- --gzip-level level
- Determines the compression level used when gzip’ing FASTQ files.
Must be a value in the range 0 to 9, with 0 disabling compression and 9
being the best compression. Defaults to 6.
- --bzip2
- If set, all FASTQ files written by AdapterRemoval will be bzip2 compressed
using the compression level specified using --bzip2-level. The
extension “.bz2” is added to files for which no filename was
given on the command-line. Defaults to off.
- --bzip2-level level
- Determines the compression level used when bzip2’ing FASTQ files.
Must be a value in the range 1 to 9, with 9 being the best compression.
Defaults to 9.
- --adapter1
adapter
- Adapter sequence expected to be found in mate 1 reads, specified in read
direction. For a detailed description of how to provide the appropriate
adapter sequences, see the “Adapters” section of the online
documentation. Default is
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG.
- --adapter2
adapter
- Adapter sequence expected to be found in mate 2 reads, specified in read
direction. For a detailed description of how to provide the appropriate
adapter sequences, see the “Adapters” section of the online
documentation. Default is
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT.
- --adapter-list
filename
- Read one or more adapter sequences from a table. The first two columns
(separated by whitespace) of each line in the file are expected to
correspond to values passed to –adapter1 and –adapter2. In
single-end mode, only column one is required. Lines starting with
‘#’ are ignored. When multiple rows are found in the table,
AdapterRemoval will try each adapter (pair), and select the best aligning
adapters for each FASTQ read processed.
- --minadapteroverlap
length
- In single-end mode, reads are only trimmed if the overlap between read and
the adapter is at least X bases long, not counting ambiguous nucleotides
(N); this is independent of the --minalignmentlength when using
--collapse, allowing a conservative selection of putative complete
inserts in single-end mode, while ensuring that all possible adapter
contamination is trimmed. The default is 0.
- --mm mismatchrate
- The allowed fraction of mismatches allowed in the aligned region. If the
value is less than 1, then the value is used directly. If
`--mismatchrate is greater than 1, the rate is set to 1 /
--mismatchrate. The default setting is 3 when trimming adapters,
corresponding to a maximum mismatch rate of 1/3, and 10 when using
--identify-adapters.
- --shift n
- To allow for missing bases in the 5’ end of the read, the program
can let the alignment slip --shift bases in the 5’ end. This
corresponds to starting the alignment maximum --shift nucleotides
into read2 (for paired-end) or the adapter (for single-end). The default
is 2.
- --trim5p n
[n]
- Trim the 5’ of reads by a fixed amount after removing adapters, but
before carrying out quality based trimming. Specify one value to trim mate
1 and mate 2 reads the same amount, or two values separated by a space to
trim each mate different amounts. Off by default.
- --trimns
- Trim consecutive Ns from the 5’ and 3’ termini. If quality
trimming is also enabled (--trimqualities), then stretches of mixed
low-quality bases and/or Ns are trimmed.
- --maxns n
- Discard reads containing more than --max ambiguous bases
(‘N’) after trimming. Default is 1000.
- --trimqualities
- Trim consecutive stretches of low quality bases (threshold set by
--minquality) from the 5’ and 3’ termini. If trimming
of Ns is also enabled (--trimns), then stretches of mixed
low-quality bases and Ns are trimmed.
- --trimwindows
window_size
- Trim low quality bases using a sliding window based approach inspired by
sickle with the given window size. See the “Window based
quality trimming” section of the manual page for a description of
this algorithm.
- --minquality
minimum
- Set the threshold for trimming low quality bases using
--trimqualities and --trimwindows. Default is 2.
- --minlength
length
- Reads shorter than this length are discarded following trimming. Defaults
to 15.
- --maxlength
length
- Reads longer than this length are discarded following trimming. Defaults
to 4294967295.
- --collapse
- In paired-end mode, merge overlapping mates into a single and recalculate
the quality scores. In single-end mode, attempt to identify templates for
which the entire sequence is available. In both cases, complete
“collapsed” reads are written with a ‘M_’ name
prefix, and “collapsed” reads which are trimmed due to
quality settings are written with a ‘MT_’ name prefix. The
overlap needs to be at least --minalignmentlength nucleotides, with
a maximum number of mismatches determined by --mm.
- --minalignmentlength
length
- The minimum overlap between mate 1 and mate 2 before the reads are
collapsed into one, when collapsing paired-end reads, or when attempting
to identify complete template sequences in single-end mode. Default is
11.
- --seed seed
- When collaping reads at positions where the two reads differ, and the
quality of the bases are identical, AdapterRemoval will select a random
base. This option specifies the seed used for the random number generator
used by AdapterRemoval. This value is also written to the settings file.
Note that setting the seed is not reliable in multithreaded mode, since
the order of operations is non-deterministic.
- --deterministic
- Enable deterministic mode; currently only affects –collapse,
different overlapping bases with equal quality are set to N quality 0,
instead of being randomly sampled.
- --barcode-list
filename
- Perform demultiplxing using table of one or two fixed-length barcodes for
SE or PE reads. The table is expected to contain 2 or 3 columns, the first
of which represent the name of a given sample, and the second and third of
which represent the mate 1 and (optionally) the mate 2 barcode sequence.
For a detailed description, see the “Demultiplexing” section
of the online documentation.
- --barcode-mm-r1
n
- Maximum number of mismatches allowed for the mate 1 barcode; if not set,
this value is equal to the --barcode-mm value; cannot be higher
than the --barcode-mm value.
- --barcode-mm-r2
n
- Maximum number of mismatches allowed for the mate 2 barcode; if not set,
this value is equal to the --barcode-mm value; cannot be higher
than the --barcode-mm value.
- --demultiplex-only
- Only carry out demultiplexing using the list of barcodes supplied with
–barcode-list. No other processing is done.
As of v2.2.2, AdapterRemoval implements sliding window based
approach to quality based base-trimming inspired by sickle. If
window_size is greater than or equal to 1, that number is used as the
window size for all reads. If window_size is a number greater than or
equal to 0 and less than 1, then that number is multiplied by the length of
individual reads to determine the window size. If the window length is zero
or is greater than the current read length, then the read length is used
instead.
Reads are trimmed as follows for a given window size:
- 1.
- The new 5’ is determined by locating the first window where both
the average quality and the quality of the first base in the window is
greater than --minquality.
- 2.
- The new 3’ is located by sliding the first window right, until the
average quality becomes less than or equal to --minquality. The new
3’ is placed at the last base in that window where the quality is
greater than or equal to --minquality.
- 3.
- If no 5’ position could be determined, the read is discarded.
AdapterRemoval exists with status 0 if the program ran
succesfully, and with a non-zero exit code if any errors were encountered.
Do not use the output from AdapterRemoval if the program returned a non-zero
exit code!
Please report any bugs using the AdapterRemoval issue-tracker:
https://github.com/MikkelSchubert/adapterremoval/issues
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or at your
option any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see
<http://www.gnu.org/licenses/>.
Mikkel Schubert; Stinus Lindgreen
2017, Mikkel Schubert; Stinus Lindgreen