rabema_build_gold_standard - RABEMA Gold Standard Builder
rabema_build_gold_standard [OPTIONS]
--out-gsi OUT.gsi --reference REF.fa
--in-bam PERFECT.{sam,bam}
This program allows one to build a RABEMA gold standard. The input
is a reference FASTA file and a perfect SAM/BAM map (e.g. created using
RazerS 3 in full-sensitivity mode).
The input SAM/BAM file must be sorted by coordinate. The
program will create a FASTA index file REF.fa.fai for fast random
access to the reference.
- -h, --help
- Display the help message.
- --version
- Display version information.
- -v, --verbose
- Enable verbose output.
- -vv,
--very-verbose
- Enable even more verbose output.
- -o, --out-gsi
OUTPUT_FILE
- Path to write the resulting GSI file to. Valid filetype is:
.gsi[.*], where * is any of the following extensions: gz for
transparent (de)compression.
- -r, --reference
INPUT_FILE
- Path to load reference FASTA from. Valid filetypes are: .sam[.*],
.raw[.*], .gbk[.*], .frn[.*], .fq[.*],
.fna[.*], .ffn[.*], .fastq[.*], .fasta[.*],
.faa[.*], .fa[.*], .embl[.*], and .bam, where
* is any of the following extensions: gz, bz2, and
bgzf for transparent (de)compression.
- -b, --in-bam
INPUT_FILE
- Path to load the "perfect" SAM/BAM file from. Valid filetypes
are: .sam[.*] and .bam, where * is any of the following
extensions: gz, bz2, and bgzf for transparent
(de)compression.
- --oracle-mode
- Enable oracle mode. This is used for simulated data when the input SAM/BAM
file gives exactly one position that is considered as the true sample
position.
- --match-N
- When set, N matches all characters without penalty.
- --distance-metric
STRING
- Set distance metric. Valid values: hamming, edit. Default: edit. One of
hamming and edit. Default: edit.
- -e, --max-error
INTEGER
- Maximal error rate to build gold standard for in percent. This parameter
is an integer and relative to the read length. In case of oracle mode, the
error rate for the read at the sampling position is used and RATE
is used as a cutoff threshold. Default: 0.
A return value of 0 indicates success, any other value indicates
an error.
From version 1.1, great care has been taken to keep the memory
requirements as low as possible. There memory required is two times the size
of the largest chromosome plus some constant memory for each match.
For example, the memory usage for 100bp human genome reads at 5%
error rate was 1.7GB. Of this, roughly 400GB came from the chromosome and
1.3GB from the matches.