DOKK / manpages / debian 11 / wtdbg2 / wtdbg2.1.en

User Commands

NAME

wtdbg2 - de novo sequence assembler for long noisy reads

SYNOPSIS

wtdbg2 [options] -i <reads.fa> -o <prefix> [reads.fa ...]

DESCRIPTION

WTDBG: De novo assembler for long noisy sequences Author: Jue Ruan <ruanjue@gmail.com> Version: 2.5 (20190621)

OPTIONS

-i <string> Long reads sequences file (REQUIRED; can be multiple), []

-o <string> Prefix of output files (REQUIRED), []

-t <int>: Number of threads, 0 for all cores, [4]
-f: Force to overwrite output files

-x <string> Presets, comma delimited, []

preset1/rsII/rs: -p 21 -S 4 -s 0.05 -L 5000: preset2: -p 0 -k 15 -AS 2 -s 0.05 -L 5000 preset3: -p 19 -AS 2 -s 0.05 -L 5000

: sequel/sq
: nanopore/ont:
: (genome size < 1G: preset2) -p 0 -k 15 -AS 2 -s 0.05 -L 5000 (genome size >= 1G: preset3) -p 19 -AS 2 -s 0.05 -L 5000
: preset4/corrected/ccs: -p 21 -k 0 -AS 4 -K 0.05 -s 0.5

-g <number> Approximate genome size (k/m/g suffix allowed) [0]

-X <float>: Choose the best <float> depth from input reads(effective with -g) [50.0]
-L <int>: Choose the longest subread and drop reads shorter than <int> (5000 recommended for PacBio) [0] Negative integer indicate tidying read names too, e.g. -5000.
-k <int>: Kmer fsize, 0 <= k <= 23, [0]
-p <int>: Kmer psize, 0 <= p <= 23, [21] k + p <= 25, seed is <k-mer>+<p-homopolymer-compressed>
-K <float>: Filter high frequency kmers, maybe repetitive, [1000.05] >= 1000 and indexing >= (1 - 0.05) * total_kmers_count
-S <float>: Subsampling kmers, 1/(<-S>) kmers are indexed, [4.00] -S is very useful in saving memory and speeding up please note that subsampling kmers will have less matched length
-l <float>: Min length of alignment, [2048]
-m <float>: Min matched length by kmer matching, [200]
-R: Enable realignment mode
-A: Keep contained reads during alignment
-s <float>: Min similarity, calculated by kmer matched length / aligned length, [0.05]
-e <int>: Min read depth of a valid edge, [3]
-q: Quiet
-v: Verbose (can be multiple)
-V: Print version information and then exit
--help: Show more options

: ** more options ** --cpu <int>
: See -t 0, default: all cores

--input <string> +

: See -i

--force

: See -f

--prefix <string>

: See -o

--preset <string>

: See -x

--kmer-fsize <int>

: See -k 0

--kmer-psize <int>

: See -p 21

--kmer-depth-max <float>

: See -K 1000.05

-E, --kmer-depth-min <int>

: Min kmer frequency, [2]

--kmer-subsampling <float>

: See -S 4.0

--kbm-parts <int>

: Split total reads into multiple parts, index one part by one to save memory, [1]

--aln-kmer-sampling <int>

: Select no more than n seeds in a query bin, default: 256

--dp-max-gap <int>

: Max number of bin(256bp) in one gap, [4]

--dp-max-var <int>

: Max number of bin(256bp) in one deviation, [4]

--dp-penalty-gap <int>

: Penalty for BIN gap, [-7]

--dp-penalty-var <int>

: Penalty for BIN deviation, [-21]

--aln-min-length <int>

: See -l 2048

--aln-min-match <int>

: See -m 200. Here the num of matches counting basepair of the matched kmer's regions

--aln-min-similarity <float>

: See -s 0.05

--aln-max-var <float>

: Max length variation of two aligned fragments, default: 0.25

--aln-dovetail <int>

: Retain dovetail overlaps only, the max overhang size is <--aln-dovetail>, the value should be times of 256, -1 to disable filtering, default: 256

--aln-strand <int>

: 1: forward, 2: reverse, 3: both. Please don't change the deault value 3, unless you exactly know what you are doing

--aln-maxhit <int>

: Max n hits for each read in build graph, default: 1000

--aln-bestn <int>

: Use best n hits for each read in build graph, 0: keep all, default: 500 <prefix>.alignments always store all alignments

-R, --realign

: Enable re-alignment, see --realn-kmer-psize=15, --realn-kmer-subsampling=1, --realn-min-length=2048, --realn-min-match=200, --realn-min-similarity=0.1, --realn-max-var=0.25

--realn-kmer-psize <int>

: Set kmer-psize in realignment, (kmer-ksize always eq 0), default:15

--realn-kmer-subsampling <int>

: Set kmer-subsampling in realignment, default:1

--realn-min-length <int>

: Set aln-min-length in realignment, default: 2048

--realn-min-match <int>

: Set aln-min-match in realignment, default: 200

--realn-min-similarity <float>

: Set aln-min-similarity in realignment, default: 0.1

--realn-max-var <float>

: Set aln-max-var in realignment, default: 0.25

-A, --aln-noskip

: Even a read was contained in previous alignment, still align it against other reads

--keep-multiple-alignment-parts

: By default, wtdbg will keep only the best alignment between two reads after chainning. This option will disable it, and keep multiple

--verbose +

: See -v. -vvvv will display the most detailed information

--quiet

: See -q

--limit-input <int>

: Limit the input sequences to at most <int> M bp. Usually for test

-L <int>, --tidy-reads <int>

: Default: 0. Pick longest subreads if possible. Filter reads less than <--tidy-reads>. Please add --tidy-name or set --tidy-reads to nagetive value if want to rename reads. Set to 0 bp to disable tidy. Suggested value is 5000 for pacbio RSII reads

--tidy-name

: Rename reads into 'S%010d' format. The first read is named as S0000000001

--rdname-filter <string>

: A file contains lines of reads name to be discarded in loading. If you want to filter reads by yourself, please also set -X 0

--rdname-includeonly <string>

: Reverse manner with --rdname-filter

-g <number>, --genome-size <number>

: Provide genome size, e.g. 100.4m, 2.3g. In this version, it is used with -X/--rdcov-cutoff in selecting reads just after readed all.

-X <float>, --rdcov-cutoff <float>

: Default: 50.0. Retaining 50.0 folds of genome coverage, combined with -g and --rdcov-filter.

--rdcov-filter [0|1]

: Default 0. Strategy 0: retaining longest reads. Strategy 1: retaining medain length reads.

--err-free-nodes

: Select nodes from error-free-sequences only. E.g. you have contigs assembled from NGS-WGS reads, and long noisy reads. You can type '--err-free-seq your_ctg.fa --input your_long_reads.fa --err-free-nodes' to perform assembly somehow act as long-reads scaffolding

--node-len <int>

: The default value is 1024, which is times of KBM_BIN_SIZE(always equals 256 bp). It specifies the length of intervals (or call nodes after selecting). kbm indexs sequences into BINs of 256 bp in size, so that many parameter should be times of 256 bp. There are: --node-len, --node-ovl, --aln-min-length, --aln-dovetail . Other parameters are counted in BINs, --dp-max-gap, --dp-max-var .

--node-matched-bins <int>

: Min matched bins in a node, default:1

--node-ovl <int>

: Default: 256. Max overlap size between two adjacent intervals in any read. It is used in selecting best nodes representing reads in graph

--node-drop <float>

: Default: 0.25. Will discard an node when has more this ratio intervals are conflicted with previous generated node

-e <int>, --edge-min=<int>

: Default: 3. The minimal depth of a valid edge is set to 3. In another word, Valid edges must be supported by at least 3 reads When the sequence depth is low, have a try with --edge-min 2. Or very high, try --edge-min 4

--edge-max-span <int>

: Default: 1024 BINs. Program will build edges of length no large than 1024

--drop-low-cov-edges

: Don't attempt to rescue low coverage edges

--node-min <int>

: Min depth of an interval to be selected as valid node. Defaultly, this value is automatically the same with --edge-min.

--node-max <int>

: Nodes with too high depth will be regarded as repetitive, and be masked. Default: 200, more than 200 reads contain this node

--ttr-cutoff-depth <int>, 0

--ttr-cutoff-ratio <float>, 0.5

: Tiny Tandom Repeat. A node located inside ttr will bring noisy in graph, should be masked. The pattern of such nodes is: depth >= <--ttr-cutoff-depth>, and none of their edges have depth greater than depth * <--ttr-cutoff-ratio 0.5> set --ttr-cutoff-depth 0 to disable ttr masking

--dump-kbm <string>

: Dump kbm index into file for loaded by `kbm` or `wtdbg`

--dump-seqs <string>

: Dump kbm index (only sequences, no k-mer index) into file for loaded by `kbm` or `wtdbg` Please note: normally load it with --load-kbm, not with --load-seqs

--load-kbm <string>

: Instead of reading sequences and building kbm index, which is time-consumed, loading kbm-index from already dumped file. Please note that, once kbm-index is mmaped by kbm -R <kbm-index> start, will just get the shared memory in minute time. See `kbm` -R <your_seqs.kbmidx> [start | stop]

--load-seqs <string>

: Similar with --load-kbm, but only use the sequences in kbmidx, and rebuild index in process's RAM.

--load-alignments <string> +

: `wtdbg` output reads' alignments into <--prefix>.alignments, program can load them to fastly build assembly graph. Or you can offer other source of alignments to `wtdbg`. When --load-alignment, will only reading long sequences but skip building kbm index You can type --load-alignments <file> more than once to load alignments from many files

--load-clips <string>

: Combined with --load-nodes. Load reads clips. You can find it in `wtdbg`'s <--prefix>.clps

--load-nodes <sting>

: Load dumped nodes from previous execution for fast construct the assembly graph, should be combined with --load-clips. You can find it in `wtdbg`'s <--prefix>.1.nodes

--bubble-step <int>

: Max step to search a bubble, meaning the max step from the starting node to the ending node. Default: 40

--tip-step <int>

: Max step to search a tip, 10

--ctg-min-length <int>

: Min length of contigs to be output, 5000

--ctg-min-nodes <int>

: Min num of nodes in a contig to be output, 3

--minimal-output

: Will generate as less output files (<--prefix>.*) as it can

--bin-complexity-cutoff <int>

: Used in filtering BINs. If a BIN has less indexed valid kmers than <--bin-complexity-cutoff 2>, masks it.

--no-local-graph-analysis

: Before building edges, for each node, local-graph-analysis reads all related reads and according nodes, and builds a local graph to judge whether to mask it The analysis aims to find repetitive nodes

--no-read-length-sort

: Defaultly, `wtdbg` sorts input sequences by length DSC. The order of reads affects the generating of nodes in selecting important intervals

--keep-isolated-nodes

: In graph clean, `wtdbg` normally masks isolated (orphaned) nodes

--no-read-clip

: Defaultly, `wtdbg` clips a input sequence by analyzing its overlaps to remove high error endings, rolling-circle repeats (see PacBio CCS), and chimera. When building edges, clipped region won't contribute. However, `wtdbg` will use them in the final linking of unitigs

--no-chainning-clip

: Defaultly, performs alignments chainning in read clipping ** If '--aln-bestn 0 --no-read-clip', alignments will be parsed directly, and less RAM spent on recording alignments

AUTHOR

This manpage was written by Andreas Tille for the Debian distribution and
can be used for any other usage of the program.

April 2020

wtdbg2 2.5