DOKK / manpages / debian 11 / bbmap / bbnorm.sh.1.en
BBNORM.SH(1) User Commands BBNORM.SH(1)

bbnorm.sh - Kmer-based error-correction and normalization tool

bbnorm.sh in=<input> out=<reads to keep> outt=<reads to toss> hist=<histogram output>

Normalizes read depth based on kmer counts. Can also error-correct, bin reads by kmer depth, and generate a kmer depth histogram. However, Tadpole has superior error-correction to BBNorm. Please read bbmap/docs/guides/BBNormGuide.txt for more information.

Primary input. Use in2 for paired reads in a second file
Second input file for paired reads in two files
Additional files to use for input (generating hash table) but not for output
Break up FASTA reads longer than this. Can be useful when processing scaffolded genomes
Use at most this many reads when building the hashtable (-1 means all)
Process every nth kmer, and skip the rest
Process every nth read, and skip the rest
May be set to true or false to force the input read file to ovverride autodetection of the input file as paired interleaved.
ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.

File for normalized or corrected reads. Use out2 for paired reads in a second file
(outtoss) File for reads that were excluded from primary output
Only process this number of reads, then quit (-1 means all)
Use sampling on output as well as input (not used if sample rates are 1)
Set to true to keep all reads (e.g. if you just want error correction).
Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin in histograms.
Default is false, to prevent confusion about how there can be 0-count kmers. The reason is that based on the 'minq' and 'minprob' settings, some kmers may be excluded from the bloom filter.
This will specify a directory for temp files (only needed for multipass runs). If null, they will be written to the output directory.
Allows enabling/disabling of temporary directory; if disabled, temp files will be written to the output directory.
ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).
Rename reads based on their kmer depth.

Kmer length (values under 32 are most efficient, but arbitrarily high values are supported)
Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Automatically reduced to 16 in 2-pass.
Large values decrease accuracy for a fixed amount of memory, so use the lowest number you can that will still capture highest-depth kmers.
Number of times each kmer is hashed and stored. Higher is slower.
Higher is MORE accurate if there is enough memory, and LESS accurate if there is not enough memory.
True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable. The prefilter is more memory-efficient because it uses 2-bit cells.
Number of hashes for prefilter.
(pbits) Bits per cell in prefilter.
Fraction of memory to allocate to prefilter.
More passes can sometimes increase accuracy by iteratively removing low-depth kmers
Ignore kmers containing bases with quality below this
Ignore kmers with overall probability of correctness below this
(t) Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X due to I/O threads.
(removeduplicatekmers) When true, a kmer's count will only be incremented once per read pair, even if that kmer occurs more than once.

(fs) Do a slower, high-precision bloom filter lookup of kmers that appear to have an abnormally high depth due to collisions.
(tgt) Target normalization depth. NOTE: All depth parameters control kmer depth, not read depth.
For kmer depth Dk, read depth Dr, read length R, and kmer size K: Dr=Dk*(R/(R-K+1))
(max) Reads will not be downsampled when below this depth, even if they are above the target depth.
(min) Kmers with depth below this number will not be included when calculating the depth of a read.
(mgkpr) Reads must have at least this many kmers over min depth to be retained. Aka 'mingoodkmersperread'.
(dp) Read depth is by default inferred from the 54th percentile of kmer depth, but this may be changed to any number 1-100.
(uld) For pairs, use the depth of the lower read as the depth proxy.
(dr) Generate random numbers deterministically to ensure identical output between multiple runs. May decrease speed with a huge number of threads.
(p) 1 pass is the basic mode. 2 passes (default) allows greater accuracy, error detection, better contol of output depth.

(highdepthpercentile) Position in sorted kmer depth array used as proxy of a read's high kmer depth.
(lowdepthpercentile) Position in sorted kmer depth array used as proxy of a read's low kmer depth.
(tbr) Throw away reads detected as containing errors.
(rbb) Only toss bad pairs if both reads are bad.
(edr) Reads with a ratio of at least this much between their high and low depth kmers will be classified as error reads.
(ht) Threshold for high kmer. A high kmer at this or above are considered non-error.
(lt) Threshold for low kmer. Kmers at this and below are always considered errors.

Set to true to correct errors. NOTE: Tadpole is now preferred for ecc as it does a better job.
Correct up to this many errors per read. If more are detected, the read will remain unchanged.
(ecr) Adjacent kmers with a depth ratio of at least this much between will be classified as an error.
(echt) Threshold for high kmer. A kmer at this or above may be considered non-error.
(eclt) Threshold for low kmer. Kmers at this and below are considered errors.
Do not correct bases with quality above this value.
(aggressiveErrorCorrection) Sets more aggressive values of ecr=100, ecclimit=7, echt=16, eclt=3.
(conservativeErrorCorrection) Sets more conservative values of ecr=180, ecclimit=2, echt=30, eclt=1, sl=4, pl=4.
(markErrorsOnly) Marks errors by reducing quality value of suspected errors; does not correct anything.
(markUncorrectableErrors) Marks errors only on uncorrectable reads; requires 'ecc=t'.
(ecco) Error correct by read overlap.

(lbd) Cutoff for low depth bin.
(hbd) Cutoff for high depth bin.
Pairs in which both reads have a median below lbd go into this file.
Pairs in which both reads have a median above hbd go into this file.
All other pairs go into this file.

Specify a file to write the input kmer depth histogram.
Specify a file to write the output kmer depth histogram.
(histogramcolumns) Number of histogram columns, 2 or 3.
(printzerocoverage) Print lines in the histogram with zero coverage.
Max kmer depth displayed in histogram. Also affects statistics displayed, but does not affect normalization.

Write the peaks to this file. Default is stdout.
(h) Ignore peaks shorter than this.
(v) Ignore peaks with less area than this.
(w) Ignore peaks narrower than this.
(minp) Ignore peaks with an X-value below this.
(maxp) Ignore peaks with an X-value above this.
(maxpc) Print up to this many peaks (prioritizing height).

This will set Java's memory usage, overriding autodetection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
Disable assertions.

Written by Brian Bushnell (Last modified October 19, 2017)

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems, or post at: http://seqanswers.com/forums/showthread.php?t=41057

This manpage was written by Andreas Tille for the Debian distribution and can be used for any other usage of the program.

April 2019 bbnorm.sh 38.43