crac v. 2.5.0 - Crac is a tool to analyse RNA-Seq data provided by
NGS.
crac [ options ] -i <index_file> -r
<reads_file1> [reads_file2] -k <int> -o
<output_file>
crac -h|--help
crac -f|--full-help
crac -v|--version
crac CRAC: an integrated approach to the analysis of
RNA-seq reads
Whatever the biological questions it addresses, each RNA-seq
analysis requires a computational prediction of either small scale
mutations, indels, splice junctions or fusion RNAs. This prediction is
currently performed using complex pipelines involving multiple tools for
mapping, coverage computation, and prediction at distinct steps. We propose
a novel way of analyzing reads that integrates genomic locations and local
coverage, and delivers all above mentioned predictions in a single step. Our
program, CRAC, uses a double k-mer profiling approach to detect candidate
mutations, indels, splice or fusion junctions in each single read. Compared
to existing tools, CRAC provides state of the art sensitivity and improved
precision for all types of predictions, yielding high rates of true positive
candidates (99.5% for splice junctions). When applied to four breast cancer
libraries, CRAC recovered 74% of validated fusion RNAs and predicted
reccurrent fusion junctions that were overseen in previous studies.
Importantly, CRAC improves its predictive performance when supplied with
e.g. 200 nt reads and should fit future needs of read analyses.
crac As a lot of softwares, there are many optional
parameters in CRAC but only three are mandatory. This document is intended
to guide users of CRAC to choose the more appropriate parameters according
to their needs:
- -h, --help
- to print the principal help page of CRAC
- -f,
--full-help
- to print the complete help page of CRAC
- -v, --version
- to print version of CRAC
All these flags must be set.
- -i
<index_file>
- is the name of the index previously built with the crac-index binary file.
Note that crac-index construct the structure <index_file.ssa> with
its configuration <index_file.conf> so only the prefix
<index_file> must be specified (without extension) to consider the
structure and the configuration files both in CRAC
- -r <reads_file1>
[reads_file2]
- is the source(s) of the FASTA or FASTQ file(s) containing the reads. Note
that the number of files depends if single or paired-reads. The input file
may also be compressed using gzip
- -k
<int>
- is the length of the k-mer to be used to map the reads on the reference
<index_file>. Note that the condition (k < m) is necessary and
reads (or both paired reads) are ignored if m < k. It must be chosen to
ensure (as much as possible) that a k-mer has a very high probability to
occur a single time on the genome
- -o, --sam
<output_file>
- is the output file in SAM format (see the Documentation of SAM format in
CRAC for more details) or print on STDOUT with "-o -"
argument
- --stranded
- must be specificied if reads are produced by a stranded protocol of
RNA-Seq (not stranded by default)
- --fr/--rf/--ff
- set the mates alignement orientation (--rf by default)
- -m, --reads-length,
-m <int>
- must be specified for reads of fixed length. If the read length is fixed,
we deeply recommend you to specify the read length, by using the -m
parameter. CRAC will therefore be much faster. --reads-length <int>
is specified for variable or longer reads, reads shorter are ignored and
reads longer are trimmed
- --treat-multiple
<int>
- display alignments with multiple locations (with a fixed limit) rather
than a single alignment per read in the SAM file
- --nb-threads
<int>
- is the number of threads to run crac, computational time is almost divided
by the number of threads (one thread by default)
- --max-locs
<int>
- corresponds to the max number of occurrences retrieved in the index for a
given k-mer: smaller is faster, but with a small value, you may miss some
locations that would help CRAC detecting the right cause
- --no-ambiguity
<none>
- discard biological events (splice, svn, indel, chimera) which have several
matches on the reference index. Indeed, if crac has identified a
biological cause in the read that can match in differents places of the
genome we classify this cause as a biological undetermined event.
- --all
<base_filename>
- set output base filename for all causes following. Note that only a
base_filename must be specified. Then, the appropriate file extension is
added for each cause (SNP, chimera, splice, etc) set output base filename
for all causes following
- --normal
<output_file>
- save reads that do not contain any break
- --almost-normal
<output_file>
- save reads that do not contain any break but with a variable support
- --single
<output_file>
- save reads which are located in this way: at least
--min-percent-single-loc <float> of k-mers are once located on the
reference index
- --duplicate
<output_file>
- save reads which are located in this way: at least
--min-percent-duplication-loc <float> of k-mers are a few times on
the reference index (ie. between --min-duplication <int> and
--max-duplication <int> of locations)
- --multiple
<output_file>
- save reads which are located in this way: at least
--min-percent-multiple-loc <float> of k-mers are a many times on the
reference index (ie. more than --max-duplication <int> of
locations)
- --none
<output_file>
-
save reads which are not located on the reference index
- --snv
<output_file>
- save reads that contain at least a snv
- --indel
<output_file>
- save reads that contain at least a biological indel
- --splice
<output_file>
- save reads that contain at least a splicing junction
- --weak-splice
<output_file>
- save reads that contain at least a low coverage splicing junction
- --chimera
<output_file>
- save reads that contain at least a chimera junction (junction on different
chromosomes, strands or genes)
- --paired-end-chimera
<output_file>
- paired-end-chimera <output_file>= save paired-end reads that
contains a chimera in the non-sequenced part of the original
fragment.
- --biological
<output_file>
- save reads that contain a biological cause but for which there is not
enough informations to be more specific. Note that the biological cause is
described for each read
- --errors
<output_file>
- save reads that contain at least a sequence error
- --repeat
<output_file>
-
save reads that contain a repeated sequence: at least
--min-percent-repetition-loc <float> percent of k-mers of a given
read are located at least --min-repetition <int> occurrences on the
reference index
- --undetermined
<output_file>
- save reads that contain an undetermined error: some k-mers are not located
on the genome, but the reason for that could not be determined. Note that
the error is described for each read
- --nothing
<output_file>
- save reads that are unclassified
- --deep-snv
- must be specified to increase sensitivity to find SNVs at the cost of more
computations (only substitution, no indels YET). That process searches for
SNV in border cases reads. Those reads would otherwise be classified in
bioundetermined
- --stringent-chimera
- must be specified to increase accuracy to find chimera junctions in
exchange of sensitivity and computational times
- --emt
- launch an exact matching processing of reads on the index. Either the
argument specified with -k is equal to 0 which means that the entire read
is perfectly mapped on the genome or only a factor of length k per read is
mapped (the first one with a location) and the rest is sofclipped. With
this process, reads are not indexed and it provides a low memory
consumption. Note this kind of method is very useful for DGE reads
mapping.
- --server
- launch a server to query a given read more precisely. That process is
useful for debugging. Note that the output arguments will not be taken
into account. Give an --input-name-server <string> to set the input
fifo name (classify.fifo by default) and give an --output-name-server
<string> to set the output fifo name (classify.out.fifo by default).
The server can then be used through a client crac-client
- --detailed-sam
- more informations are added in SAM output file. See the Documentation of
SAM format in CRAC for more details
- --min-percent-single-loc
<float>
- is, to consider a given read as uniquely mapped, the minimum proportion of
k-mers that are uniquely mapped on the index (0.15 by default)
- --min-duplication
<int>
- is the minimum number of location to consider a duplicated k-mer (2 by
default)
- --max-duplication
<int>
- is the maximum number of location to consider a duplicated k-mer (9 by
default)
- --min-percent-duplication-loc
<float>
- is, to consider a given read as duplicated, the minimum proportion of
k-mers that are duplicated on the index (0.15 by default)
- --min-percent-multiple-loc
<float>
- is, to consider a given read as “multiple”, the minimum
proportion of k-mers that are multiple mapped on the index (0.50 by
default)
- --min-repetition
<int>
- is the minimum number of locations to consider a repeated k-mer (20 by
default)
- --max-percent-repetition-loc
<float>
- is, for a given read, the minimum proportion of k-mers that are repeated
on the index to consider a repetition (0.20 by default)
- --max-splice-length
<int>
- is the threshold to consider a splice, ie. a splice is reported if the
junction length is below max-splice-length <int>, a chimera is
considered otherwise (distance by default is 300Kb)
- --max-bio-indel
<int>
- is the threshold to consider a biological indel, ie. an indel is reported
if the gap length is below max-bio-indel, a splice is considered otherwise
(distance by default is 15)
- --max-bases-retrieved
<int>
- is the number of nucleotides to display in outputfile in case of insertion
(15 by default)
- --min-support-no-cover
<float>
- is the minimum coverage to be able to report a biological cause. Note that
if a single read contains a given substitution, it is difficult (if not
impossible) to distinguish a sequence error and a biological cause (1.30
by default)
- --min-break-length
<int>
- is the minimal break length (as the percentage of k, the k-mer length) so
that a cause can be reported. Theoretically, for a given cause, the break
length is always >= (kmer_length - 1). Otherwise, the break may be
merged with a close enough break, or the break will be considered as
undetermined. (0.5 by default)
- --max-bases-randomly-matched
<int>
- A k-mer overlapping an exon-exon junction, for example, may still match on
the genome if the overlap is at the end of the read (without loss of
generality). This is due to the fact that the nucleotides starting the
second exon may be the same as the nucleotides starting the intron.
Theoretically, there is a 0.25 probability that we have the same
nucleotide at the first position of the intron and the exon. This option
specifies how many nucleotides may be matched randomly at most
- --max-extension-length
<int>
- is the maximum number of k-mers extended at each side of a read break. In
fact, for a given break, k-mers with false locations can generate false
biological causes, so the consistency is checked for each side of the
break to discard false k-mers and readjust the good boundaries of the
break (10 by default)
- --nb-tags-info-stored
<int>
- is a buffer to store informations for each thread during the computing
phase (1000 by default). This value must be increased if threads work
below their real capabilities. With --nb-threads 15, CPU usage must be
about 1400%
- --reads-index
<string>
- the reads index data-structure uses by CRAC. Available reads index are:
JELLYFISH and GKARRAYS. (JELLYFISH by default).
- --nb-nucleotides-snp-comparison
<int>
-
is the minimum k-mer length tolerated for the deep SNVs search (8 by
default)
- --max-number-of-merges
<int>
-
is the maximum number of merges tolerated during the break merge process
for the chimera detection (4 by default)
- --min-score-chimera-stringent
<float>
-
is the mimimal score to consider a chimera event
otherwise it is classify as a bioundetermined event (0.6 by default)
The full documentation for crac is maintained as a org
manual. If the info and crac programs are properly installed
at your site, the command
- info crac
should give you access to the complete manual.
You can contact Nicolas PHILIPPE, Mikael SALSON, Jerome AUDOUX and
Alban MANCHERON by sending an e-mail to
<crac-bugs@lists.gforge.inria.fr>.
- Programming:
-
Nicolas PHILIPPE <nphilippe.research@gmail.com>
Mikaël SALSON <mikael.salson@lifl.fr> Jérome AUDOUX
<jerome.audoux@gmail.com>
with additional contribution for the packaging of:
Alban MANCHERON <alban.mancheron@lirmm.fr>