wtdbg2 - de novo sequence assembler for long noisy reads
wtdbg2 [options] -i <reads.fa> -o
<prefix> [reads.fa ...]
WTDBG: De novo assembler for long noisy sequences Author: Jue Ruan
<ruanjue@gmail.com> Version: 2.5 (20190621)
-i <string> Long reads sequences file (REQUIRED;
can be multiple), []
-o <string> Prefix of output files (REQUIRED),
[]
- -t <int>
- Number of threads, 0 for all cores, [4]
- -f
- Force to overwrite output files
-x <string> Presets, comma delimited, []
- preset1/rsII/rs:
-p 21 -S 4 -s 0.05 -L 5000
- preset2: -p 0 -k 15 -AS 2 -s 0.05 -L
5000 preset3: -p 19 -AS 2 -s 0.05 -L 5000
- sequel/sq
- nanopore/ont:
- (genome size < 1G: preset2) -p 0 -k 15 -AS 2
-s 0.05 -L 5000 (genome size >= 1G: preset3) -p 19
-AS 2 -s 0.05 -L 5000
- preset4/corrected/ccs: -p 21 -k 0 -AS 4 -K
0.05 -s 0.5
-g <number> Approximate genome size (k/m/g suffix
allowed) [0]
- -X <float>
- Choose the best <float> depth from input reads(effective with
-g) [50.0]
- -L <int>
- Choose the longest subread and drop reads shorter than <int> (5000
recommended for PacBio) [0] Negative integer indicate tidying read names
too, e.g. -5000.
- -k <int>
- Kmer fsize, 0 <= k <= 23, [0]
- -p <int>
- Kmer psize, 0 <= p <= 23, [21] k + p <= 25, seed is
<k-mer>+<p-homopolymer-compressed>
- -K <float>
- Filter high frequency kmers, maybe repetitive, [1000.05] >= 1000 and
indexing >= (1 - 0.05) * total_kmers_count
- -S <float>
- Subsampling kmers, 1/(<-S>) kmers are indexed, [4.00] -S is
very useful in saving memory and speeding up please note that subsampling
kmers will have less matched length
- -l <float>
- Min length of alignment, [2048]
- -m <float>
- Min matched length by kmer matching, [200]
- -R
- Enable realignment mode
- -A
- Keep contained reads during alignment
- -s <float>
- Min similarity, calculated by kmer matched length / aligned length,
[0.05]
- -e <int>
- Min read depth of a valid edge, [3]
- -q
- Quiet
- -v
- Verbose (can be multiple)
- -V
- Print version information and then exit
- --help
- Show more options
- ** more options ** --cpu <int>
- See -t 0, default: all cores
--input <string> +
- See -i
--force
- See -f
--prefix <string>
- See -o
--preset <string>
- See -x
--kmer-fsize <int>
- See -k 0
--kmer-psize <int>
- See -p 21
--kmer-depth-max <float>
- See -K 1000.05
-E, --kmer-depth-min <int>
- Min kmer frequency, [2]
--kmer-subsampling <float>
- See -S 4.0
--kbm-parts <int>
- Split total reads into multiple parts, index one part by one to save
memory, [1]
--aln-kmer-sampling <int>
- Select no more than n seeds in a query bin, default: 256
--dp-max-gap <int>
- Max number of bin(256bp) in one gap, [4]
--dp-max-var <int>
- Max number of bin(256bp) in one deviation, [4]
--dp-penalty-gap <int>
- Penalty for BIN gap, [-7]
--dp-penalty-var <int>
- Penalty for BIN deviation, [-21]
--aln-min-length <int>
- See -l 2048
--aln-min-match <int>
- See -m 200. Here the num of matches counting basepair of the
matched kmer's regions
--aln-min-similarity <float>
- See -s 0.05
--aln-max-var <float>
- Max length variation of two aligned fragments, default: 0.25
--aln-dovetail <int>
- Retain dovetail overlaps only, the max overhang size is
<--aln-dovetail>, the value should be times of 256, -1 to
disable filtering, default: 256
--aln-strand <int>
- 1: forward, 2: reverse, 3: both. Please don't change the deault value 3,
unless you exactly know what you are doing
--aln-maxhit <int>
- Max n hits for each read in build graph, default: 1000
--aln-bestn <int>
- Use best n hits for each read in build graph, 0: keep all, default: 500
<prefix>.alignments always store all alignments
-R, --realign
- Enable re-alignment, see --realn-kmer-psize=15,
--realn-kmer-subsampling=1,
--realn-min-length=2048,
--realn-min-match=200,
--realn-min-similarity=0.1,
--realn-max-var=0.25
--realn-kmer-psize <int>
- Set kmer-psize in realignment, (kmer-ksize always eq 0), default:15
--realn-kmer-subsampling <int>
- Set kmer-subsampling in realignment, default:1
--realn-min-length <int>
- Set aln-min-length in realignment, default: 2048
--realn-min-match <int>
- Set aln-min-match in realignment, default: 200
--realn-min-similarity <float>
- Set aln-min-similarity in realignment, default: 0.1
--realn-max-var <float>
- Set aln-max-var in realignment, default: 0.25
-A, --aln-noskip
- Even a read was contained in previous alignment, still align it against
other reads
--keep-multiple-alignment-parts
- By default, wtdbg will keep only the best alignment between two reads
after chainning. This option will disable it, and keep multiple
--verbose +
- See -v. -vvvv will display the most detailed
information
--quiet
- See -q
--limit-input <int>
- Limit the input sequences to at most <int> M bp. Usually for
test
-L <int>, --tidy-reads <int>
- Default: 0. Pick longest subreads if possible. Filter reads less than
<--tidy-reads>. Please add --tidy-name or set
--tidy-reads to nagetive value if want to rename reads. Set to 0 bp
to disable tidy. Suggested value is 5000 for pacbio RSII reads
--tidy-name
- Rename reads into 'S%010d' format. The first read is named as
S0000000001
--rdname-filter <string>
- A file contains lines of reads name to be discarded in loading. If you
want to filter reads by yourself, please also set -X 0
--rdname-includeonly <string>
- Reverse manner with --rdname-filter
-g <number>, --genome-size
<number>
- Provide genome size, e.g. 100.4m, 2.3g. In this version, it is used with
-X/--rdcov-cutoff in selecting reads just after readed all.
-X <float>, --rdcov-cutoff
<float>
- Default: 50.0. Retaining 50.0 folds of genome coverage, combined with
-g and --rdcov-filter.
--rdcov-filter [0|1]
- Default 0. Strategy 0: retaining longest reads. Strategy 1: retaining
medain length reads.
--err-free-nodes
- Select nodes from error-free-sequences only. E.g. you have contigs
assembled from NGS-WGS reads, and long noisy reads. You can type
'--err-free-seq your_ctg.fa --input your_long_reads.fa
--err-free-nodes' to perform assembly somehow act as long-reads
scaffolding
--node-len <int>
- The default value is 1024, which is times of KBM_BIN_SIZE(always equals
256 bp). It specifies the length of intervals (or call nodes after
selecting). kbm indexs sequences into BINs of 256 bp in size, so that many
parameter should be times of 256 bp. There are: --node-len,
--node-ovl, --aln-min-length, --aln-dovetail . Other
parameters are counted in BINs, --dp-max-gap, --dp-max-var
.
--node-matched-bins <int>
- Min matched bins in a node, default:1
--node-ovl <int>
- Default: 256. Max overlap size between two adjacent intervals in any read.
It is used in selecting best nodes representing reads in graph
--node-drop <float>
- Default: 0.25. Will discard an node when has more this ratio intervals are
conflicted with previous generated node
-e <int>, --edge-min=<int>
- Default: 3. The minimal depth of a valid edge is set to 3. In another
word, Valid edges must be supported by at least 3 reads When the sequence
depth is low, have a try with --edge-min 2. Or very high, try
--edge-min 4
--edge-max-span <int>
- Default: 1024 BINs. Program will build edges of length no large than
1024
--drop-low-cov-edges
- Don't attempt to rescue low coverage edges
--node-min <int>
- Min depth of an interval to be selected as valid node. Defaultly, this
value is automatically the same with --edge-min.
--node-max <int>
- Nodes with too high depth will be regarded as repetitive, and be masked.
Default: 200, more than 200 reads contain this node
--ttr-cutoff-depth <int>, 0
--ttr-cutoff-ratio <float>, 0.5
- Tiny Tandom Repeat. A node located inside ttr will bring noisy in graph,
should be masked. The pattern of such nodes is: depth >=
<--ttr-cutoff-depth>, and none of their edges have depth greater
than depth * <--ttr-cutoff-ratio 0.5> set --ttr-cutoff-depth
0 to disable ttr masking
--dump-kbm <string>
- Dump kbm index into file for loaded by `kbm` or `wtdbg`
--dump-seqs <string>
- Dump kbm index (only sequences, no k-mer index) into file for loaded by
`kbm` or `wtdbg` Please note: normally load it with --load-kbm, not
with --load-seqs
--load-kbm <string>
- Instead of reading sequences and building kbm index, which is
time-consumed, loading kbm-index from already dumped file. Please note
that, once kbm-index is mmaped by kbm -R <kbm-index> start,
will just get the shared memory in minute time. See `kbm` -R
<your_seqs.kbmidx> [start | stop]
--load-seqs <string>
- Similar with --load-kbm, but only use the sequences in kbmidx, and
rebuild index in process's RAM.
--load-alignments <string> +
- `wtdbg` output reads' alignments into <--prefix>.alignments, program
can load them to fastly build assembly graph. Or you can offer other
source of alignments to `wtdbg`. When --load-alignment, will only
reading long sequences but skip building kbm index You can type
--load-alignments <file> more than once to load alignments
from many files
--load-clips <string>
- Combined with --load-nodes. Load reads clips. You can find it in
`wtdbg`'s <--prefix>.clps
--load-nodes <sting>
- Load dumped nodes from previous execution for fast construct the assembly
graph, should be combined with --load-clips. You can find it in
`wtdbg`'s <--prefix>.1.nodes
--bubble-step <int>
- Max step to search a bubble, meaning the max step from the starting node
to the ending node. Default: 40
--tip-step <int>
- Max step to search a tip, 10
--ctg-min-length <int>
- Min length of contigs to be output, 5000
--ctg-min-nodes <int>
- Min num of nodes in a contig to be output, 3
--minimal-output
- Will generate as less output files (<--prefix>.*) as it can
--bin-complexity-cutoff <int>
- Used in filtering BINs. If a BIN has less indexed valid kmers than
<--bin-complexity-cutoff 2>, masks it.
--no-local-graph-analysis
- Before building edges, for each node, local-graph-analysis reads all
related reads and according nodes, and builds a local graph to judge
whether to mask it The analysis aims to find repetitive nodes
--no-read-length-sort
- Defaultly, `wtdbg` sorts input sequences by length DSC. The order of reads
affects the generating of nodes in selecting important intervals
--keep-isolated-nodes
- In graph clean, `wtdbg` normally masks isolated (orphaned) nodes
--no-read-clip
- Defaultly, `wtdbg` clips a input sequence by analyzing its overlaps to
remove high error endings, rolling-circle repeats (see PacBio CCS), and
chimera. When building edges, clipped region won't contribute. However,
`wtdbg` will use them in the final linking of unitigs
--no-chainning-clip
- Defaultly, performs alignments chainning in read clipping ** If
'--aln-bestn 0 --no-read-clip', alignments will be parsed directly,
and less RAM spent on recording alignments
This manpage was written by Andreas Tille for the Debian
distribution and
can be used for any other usage of the program.