flash - Fast Length Adjustment of SHort reads
flash [OPTIONS] MATES_1.FASTQ
MATES_2.FASTQ
flash [OPTIONS] --interleaved-input
(MATES.FASTQ | -)
flash [OPTIONS] --tab-delimited-input (MATES.TAB |
-)
FLASH (Fast Length Adjustment of SHort reads) is an accurate and
fast tool to merge paired-end reads that were generated from DNA fragments
whose lengths are shorter than twice the length of reads. Merged read pairs
result in unpaired longer reads, which are generally more desired in genome
assembly and genome analysis processes.
Briefly, the FLASH algorithm considers all possible overlaps at or
above a minimum length between the reads in a pair and chooses the overlap
that results in the lowest mismatch density (proportion of mismatched bases
in the overlapped region). Ties between multiple overlaps are broken by
considering quality scores at mismatch sites. When building the merged
sequence, FLASH computes a consensus sequence in the overlapped region. More
details can be found in the original publication
(http://bioinformatics.oxfordjournals.org/content/27/21/2957.full).
- - FLASH cannot merge paired-end reads that do not overlap.
- - FLASH is not designed for data that has a significant amount of indel
errors (such as Sanger sequencing data). It is best suited for Illumina
data.
The most common input to FLASH is two FASTQ files containing read
1 and read 2 of each mate pair, respectively, in the same order.
Alternatively, you may provide one FASTQ file, which may be
standard input, containing paired-end reads in either interleaved FASTQ (see
the --interleaved-input option) or tab-delimited (see the
--tab-delimited-input option) format. In all cases, gzip compressed
input is autodetected. Also, in all cases, the PHRED offset is, by default,
assumed to be 33; use the --phred-offset option to change it.
The default output of FLASH consists of the following files:
- -
out.extendedFrags.fastq
- The merged reads.
- -
out.notCombined_1.fastq
- Read 1 of mate pairs that were not merged.
- -
out.notCombined_2.fastq
- Read 2 of mate pairs that were not merged.
- - out.hist
- Numeric histogram of merged read lengths.
- -
out.histogram
- Visual histogram of merged read lengths.
FLASH also logs informational messages to standard output. These
can also be redirected to a file, as in the following example:
- $ flash reads_1.fq reads_2.fq 2>&1 | tee
flash.log
In addition, FLASH supports several features affecting the
output:
- - Writing the merged reads directly to standard output
(--to-stdout)
- - Writing gzip compressed output files (-z) or using an external
compression program (--compress-prog)
- - Writing the uncombined read pairs in interleaved FASTQ format
- (--interleaved-output)
- - Writing all output reads to a single file in tab-delimited format
- (--tab-delimited-output)
- -m,
--min-overlap=NUM
- The minimum required overlap length between two reads to provide a
confident overlap. Default: 10bp.
- -M,
--max-overlap=NUM
- Maximum overlap length expected in approximately 90% of read pairs. It is
by default set to 65bp, which works well for 100bp reads generated from a
180bp library, assuming a normal distribution of fragment lengths.
Overlaps longer than the maximum overlap parameter are still considered as
good overlaps, but the mismatch density (explained below) is calculated
over the first max_overlap bases in the overlapped region rather than the
entire overlap. Default: 65bp, or calculated from the specified read
length, fragment length, and fragment length standard deviation.
- -x,
--max-mismatch-density=NUM
- Maximum allowed ratio between the number of mismatched base pairs and the
overlap length. Two reads will not be combined with a given overlap if
that overlap results in a mismatched base density higher than this value.
Note: Any occurence of an 'N' in either read is ignored and not counted
towards the mismatches or overlap length. Our experimental results suggest
that higher values of the maximum mismatch density yield larger numbers of
correctly merged read pairs but at the expense of higher numbers of
incorrectly merged read pairs. Default: 0.25.
- -O,
--allow-outies
- Also try combining read pairs in the "outie" orientation,
e.g.
- Read 1:
<-----------
- Read 2: ------------>
- as opposed to only the "innie" orientation, e.g.
- Read 1:
- <------------
- Read 2: ----------->
- FLASH uses the same
parameters when trying each
- orientation. If a read pair can be combined in both "innie" and
"outie" orientations, the better-fitting one will be chosen
using the same scoring algorithm that FLASH normally uses.
- This option also causes extra
.innie and .outie
- histogram files to be produced.
- -p,
--phred-offset=OFFSET
- The smallest ASCII value of the characters used to represent quality
values of bases in FASTQ files. It should be set to either 33, which
corresponds to the later Illumina platforms and Sanger platforms, or 64,
which corresponds to the earlier Illumina platforms. Default: 33.
-r, --read-len=LEN
-f, --fragment-len=LEN
- -s,
--fragment-len-stddev=LEN
- Average read length, fragment length, and fragment standard deviation.
These are convenience parameters only, as they are only used for
calculating the maximum overlap (--max-overlap) parameter. The
maximum overlap is calculated as the overlap of average-length reads from
an average-size fragment plus 2.5 times the fragment length standard
deviation. The default values are -r 100, -f 180, and
-s 18, so this works out to a maximum overlap of 65 bp. If
--max-overlap is specified, then the specified value overrides the
calculated value.
- If you do not know the standard
deviation of the
- fragment library, you can probably assume that the standard deviation is
10% of the average fragment length.
- --cap-mismatch-quals
- Cap quality scores assigned at mismatch locations to 2. This was the
default behavior in FLASH v1.2.7 and earlier. Later versions will instead
calculate such scores as max(|q1 - q2|, 2); that is, the absolute value of
the difference in quality scores, but at least 2. Essentially, the new
behavior prevents a low quality base call that is likely a sequencing
error from significantly bringing down the quality of a high quality,
likely correct base call.
- --interleaved-input
- Instead of requiring files MATES_1.FASTQ and MATES_2.FASTQ, allow a single
file MATES.FASTQ that has the paired-end reads interleaved. Specify
"-" to read from standard input.
- --interleaved-output
- Write the uncombined pairs in interleaved FASTQ format.
- -I,
--interleaved
- Equivalent to specifying both --interleaved-input and
--interleaved-output.
- -Ti,
--tab-delimited-input
- Assume the input is in tab-delimited format rather than FASTQ, in the
format described below in '--tab-delimited-output'. In this mode you
should provide a single input file, each line of which must contain either
a read pair (5 fields) or a single read (3 fields). FLASH will try to
combine the read pairs. Single reads will be written to the output file
as-is if also using --tab-delimited-output; otherwise they will be
ignored. Note that you may specify "-" as the input file to read
the tab-delimited data from standard input.
- -To,
--tab-delimited-output
- Write output in tab-delimited format (not FASTQ). Each line will contain
either a combined pair in the format 'tag <tab> seq <tab>
qual' or an uncombined pair in the format 'tag <tab> seq_1
<tab> qual_1 <tab> seq_2 <tab> qual_2'.
- -o,
--output-prefix=PREFIX
- Prefix of output files. Default: "out".
- -d,
--output-directory=DIR
- Path to directory for output files. Default: current working
directory.
- -c,
--to-stdout
- Write the combined reads to standard output. In this mode, with FASTQ
output (the default) the uncombined reads are discarded. With
tab-delimited output, uncombined reads are included in the tab-delimited
data written to standard output. In both cases, histogram files are not
written, and informational messages are sent to standard error rather than
to standard output.
- -z,
--compress
- Compress the output files directly with zlib, using the gzip container
format. Similar to specifying --compress-prog=gzip and
--suffix=gz, but may be slightly faster.
- --compress-prog=PROG
- Pipe the output through the compression program PROG, which will be called
as `PROG -c -', plus any arguments specified by
--compress-prog-args. PROG must read uncompressed data from
standard input and write compressed data to standard output when invoked
as noted above. Examples: gzip, bzip2, xz, pigz.
- --compress-prog-args=ARGS
- A string of additional arguments that will be passed to the compression
program if one is specified with --compress-prog=PROG. (The
arguments '-c -' are still passed in addition to explicitly specified
arguments.)
- --suffix=SUFFIX,
--output-suffix=SUFFIX
- Use SUFFIX as the suffix of the output files after ".fastq". A
dot before the suffix is assumed, unless an empty suffix is provided.
Default: nothing; or 'gz' if -z is specified; or PROG if
--compress-prog=PROG is specified.
- -t,
--threads=NTHREADS
- Set the number of worker threads. This is in addition to the I/O threads.
Default: number of processors. Note: if you need FLASH's output to appear
deterministically or in the same order as the original reads, you must
specify -t 1 (--threads=1).
- -q, --quiet
- Do not print informational messages.
- -h, --help
- Display this help and exit.
- -v, --version
- Display version.
This manpage was written by Andreas Tille for the Debian
distribution and
can be used for any other usage of the program.