rabema_evaluate - RABEMA Evaluation
rabema_evaluate [OPTIONS] --reference
REF.fa --in-gsi IN.gsi --in-bam
MAPPING.{sam,bam}
Compare the SAM/bam output MAPPING.sam/MAPPING.bam
of any read mapper against the RABEMA gold standard previously built with
rabema_build_gold_standard. The input is a reference FASTA file, a
gold standard interval (GSI) file and the SAM/BAM input to evaluate.
The input SAM/BAM file must be sorted by queryname. The
program will create a FASTA index file REF.fa.fai for fast random
access to the reference.
- -h, --help
- Display the help message.
- --version
- Display version information.
- -v, --verbose
- Enable verbose output.
- -vv,
--very-verbose
- Enable even more verbose output.
- -r, --reference
INPUT_FILE
- Path to load reference FASTA from. Valid filetypes are: .sam[.*],
.raw[.*], .gbk[.*], .frn[.*], .fq[.*],
.fna[.*], .ffn[.*], .fastq[.*], .fasta[.*],
.faa[.*], .fa[.*], .embl[.*], and .bam, where
* is any of the following extensions: gz, bz2, and
bgzf for transparent (de)compression.
- -g, --in-gsi
INPUT_FILE
- Path to load gold standard intervals from. If compressed using gzip, the
file will be decompressed on the fly. Valid filetype is: .gsi[.*],
where * is any of the following extensions: gz for transparent
(de)compression.
- -b, --in-bam
INPUT_FILE
- Path to load the read mapper SAM or BAM output from. Valid filetypes are:
.sam[.*] and .bam, where * is any of the following
extensions: gz, bz2, and bgzf for transparent
(de)compression.
- --out-tsv
OUTPUT_FILE
- Path to write the statistics to as TSV. Valid filetype is:
.rabema_report_tsv.
- --dont-check-sorting
- Do not check sortedness (by name) of input SAM/BAM files. This is required
if the reads are not sorted by name in the original FASTQ files. Files
from the SRA and ENA generally are sorted.
- --oracle-mode
- Enable oracle mode. This is used for simulated data when the input GSI
file gives exactly one position that is considered as the true sample
position. For simulated data.
- --only-unique-reads
- Consider only reads that a single alignment in the mapping result file.
Useful for precision computation.
- --match-N
- When set, N matches all characters without penalty.
- --distance-metric
STRING
- Set distance metric. Valid values: hamming, edit. Default: edit. One of
hamming and edit. Default: edit.
- -e, --max-error
INTEGER
- Maximal error rate to build gold standard for in percent. This parameter
is an integer and relative to the read length. The error rate is ignored
in oracle mode, here the distance of the read at the sample position is
taken, individually for each read. Default: 0 Default: 0.
- -c,
--benchmark-category STRING
- Set benchmark category. One of {all, all-best, any-best. Default: all One
of all, all-best, and any-best. Default:
all.
- --trust-NM
- When set, we trust the alignment and distance from SAM/BAM file and no
realignment is performed. Off by default.
- If the CIGAR string is absent, the missing alignment end position can be
provided by this BAM tag.
- --ignore-paired-flags
- When set, we ignore all SAM/BAM flags related to pairing. This is
necessary when analyzing SAM from SOAP's soap2sam.pl script.
- --DONT-PANIC
- Do not stop program execution if an additional hit was found that
indicates that the gold standard is incorrect.
- --show-missed-intervals
- Show details for each missed interval from the GSI.
- --show-invalid-hits
- Show details for invalid hits (with too high error rate).
- --show-additional-hits
- Show details for additional hits (low enough error rate but not in gold
standard.
- --show-hits
- Show details for hit intervals.
- --show-try-hit
- Show details for each alignment in SAM/BAM input.
The occurrence of "invalid" hits in the read
mapper's output is not an error. If there are additional hits, however,
this shows an error in the gold standard.
A return value of 0 indicates success, any other value indicates
an error.
From version 1.1, great care has been taken to keep the memory
requirements as low as possible.
The evaluation step needs to store the whole reference sequence in
memory but little more memory. So, for the human genome, the memory
requirements are below 4 GB, regardless of the size of the GSI or SAM/BAM
file.