NAME

crac v. 2.5.0 - Crac is a tool to analyse RNA-Seq data provided by NGS.

SYNOPSIS

crac [ options ] -i <index_file> -r <reads_file1> [reads_file2] -k <int> -o <output_file>

crac -h|--help
crac -f|--full-help
crac -v|--version

DESCRIPTION

crac CRAC: an integrated approach to the analysis of RNA-seq reads

Whatever the biological questions it addresses, each RNA-seq analysis requires a computational prediction of either small scale mutations, indels, splice junctions or fusion RNAs. This prediction is currently performed using complex pipelines involving multiple tools for mapping, coverage computation, and prediction at distinct steps. We propose a novel way of analyzing reads that integrates genomic locations and local coverage, and delivers all above mentioned predictions in a single step. Our program, CRAC, uses a double k-mer profiling approach to detect candidate mutations, indels, splice or fusion junctions in each single read. Compared to existing tools, CRAC provides state of the art sensitivity and improved precision for all types of predictions, yielding high rates of true positive candidates (99.5% for splice junctions). When applied to four breast cancer libraries, CRAC recovered 74% of validated fusion RNAs and predicted reccurrent fusion junctions that were overseen in previous studies. Importantly, CRAC improves its predictive performance when supplied with e.g. 200 nt reads and should fit future needs of read analyses.

SPECIAL OPTIONS

crac As a lot of softwares, there are many optional parameters in CRAC but only three are mandatory. This document is intended to guide users of CRAC to choose the more appropriate parameters according to their needs:

-h, --help: to print the principal help page of CRAC
-f, --full-help: to print the complete help page of CRAC
-v, --version: to print version of CRAC

USUAL ARGUMENTS

Mandatory options

All these flags must be set.

-i <index_file>: is the name of the index previously built with the crac-index binary file. Note that crac-index construct the structure <index_file.ssa> with its configuration <index_file.conf> so only the prefix <index_file> must be specified (without extension) to consider the structure and the configuration files both in CRAC
-r <reads_file1> [reads_file2]: is the source(s) of the FASTA or FASTQ file(s) containing the reads. Note that the number of files depends if single or paired-reads. The input file may also be compressed using gzip
-k <int>: is the length of the k-mer to be used to map the reads on the reference <index_file>. Note that the condition (k < m) is necessary and reads (or both paired reads) are ignored if m < k. It must be chosen to ensure (as much as possible) that a k-mer has a very high probability to occur a single time on the genome
-o, --sam <output_file>: is the output file in SAM format (see the Documentation of SAM format in CRAC for more details) or print on STDOUT with "-o -" argument

Optional parameters

--stranded: must be specificied if reads are produced by a stranded protocol of RNA-Seq (not stranded by default)
--fr/--rf/--ff: set the mates alignement orientation (--rf by default)
-m, --reads-length, -m <int>: must be specified for reads of fixed length. If the read length is fixed, we deeply recommend you to specify the read length, by using the -m parameter. CRAC will therefore be much faster. --reads-length <int> is specified for variable or longer reads, reads shorter are ignored and reads longer are trimmed
--treat-multiple <int>: display alignments with multiple locations (with a fixed limit) rather than a single alignment per read in the SAM file
--nb-threads <int>: is the number of threads to run crac, computational time is almost divided by the number of threads (one thread by default)
--max-locs <int>: corresponds to the max number of occurrences retrieved in the index for a given k-mer: smaller is faster, but with a small value, you may miss some locations that would help CRAC detecting the right cause
--no-ambiguity <none>: discard biological events (splice, svn, indel, chimera) which have several matches on the reference index. Indeed, if crac has identified a biological cause in the read that can match in differents places of the genome we classify this cause as a biological undetermined event.

Optional output arguments

--gz <none>: all output files specified after this argument are gzipped (included for the sam file if -o/--sam argument is specified after)
--bam <none>: sam output is encode in binary format(BAM)
--summary <output_file>: save some statistics about mapping and classification
--show-progressbar <none>: show a progress bar for the process times on STDERR
--use-x-in-cigar <none>: use X cigar operator when CRAC identifies a mismatch

Optional output homemade file formats

--all <base_filename>: set output base filename for all causes following. Note that only a base_filename must be specified. Then, the appropriate file extension is added for each cause (SNP, chimera, splice, etc) set output base filename for all causes following
--normal <output_file>: save reads that do not contain any break
--almost-normal <output_file>: save reads that do not contain any break but with a variable support
--single <output_file>: save reads which are located in this way: at least --min-percent-single-loc <float> of k-mers are once located on the reference index
--duplicate <output_file>: save reads which are located in this way: at least --min-percent-duplication-loc <float> of k-mers are a few times on the reference index (ie. between --min-duplication <int> and --max-duplication <int> of locations)
--multiple <output_file>: save reads which are located in this way: at least --min-percent-multiple-loc <float> of k-mers are a many times on the reference index (ie. more than --max-duplication <int> of locations)
--none <output_file>: save reads which are not located on the reference index
--snv <output_file>: save reads that contain at least a snv
--indel <output_file>: save reads that contain at least a biological indel
--splice <output_file>: save reads that contain at least a splicing junction
--weak-splice <output_file>: save reads that contain at least a low coverage splicing junction
--chimera <output_file>: save reads that contain at least a chimera junction (junction on different chromosomes, strands or genes)
--paired-end-chimera <output_file>: paired-end-chimera <output_file>= save paired-end reads that contains a chimera in the non-sequenced part of the original fragment.
--biological <output_file>: save reads that contain a biological cause but for which there is not enough informations to be more specific. Note that the biological cause is described for each read
--errors <output_file>: save reads that contain at least a sequence error
--repeat <output_file>: save reads that contain a repeated sequence: at least
--min-percent-repetition-loc <float> percent of k-mers of a given
read are located at least --min-repetition <int> occurrences on the
reference index
--undetermined <output_file>: save reads that contain an undetermined error: some k-mers are not located on the genome, but the reason for that could not be determined. Note that the error is described for each read
--nothing <output_file>: save reads that are unclassified

Optional process for specific research

--deep-snv: must be specified to increase sensitivity to find SNVs at the cost of more computations (only substitution, no indels YET). That process searches for SNV in border cases reads. Those reads would otherwise be classified in bioundetermined
--stringent-chimera: must be specified to increase accuracy to find chimera junctions in exchange of sensitivity and computational times

Optional process launcher (once must be selected)

--emt: launch an exact matching processing of reads on the index. Either the argument specified with -k is equal to 0 which means that the entire read is perfectly mapped on the genome or only a factor of length k per read is mapped (the first one with a location) and the rest is sofclipped. With this process, reads are not indexed and it provides a low memory consumption. Note this kind of method is very useful for DGE reads mapping.
--server: launch a server to query a given read more precisely. That process is useful for debugging. Note that the output arguments will not be taken into account. Give an --input-name-server <string> to set the input fifo name (classify.fifo by default) and give an --output-name-server <string> to set the output fifo name (classify.out.fifo by default). The server can then be used through a client crac-client

Additional settings for users

--detailed-sam: more informations are added in SAM output file. See the Documentation of SAM format in CRAC for more details
--min-percent-single-loc <float>: is, to consider a given read as uniquely mapped, the minimum proportion of k-mers that are uniquely mapped on the index (0.15 by default)
--min-duplication <int>: is the minimum number of location to consider a duplicated k-mer (2 by default)
--max-duplication <int>: is the maximum number of location to consider a duplicated k-mer (9 by default)
--min-percent-duplication-loc <float>: is, to consider a given read as duplicated, the minimum proportion of k-mers that are duplicated on the index (0.15 by default)
--min-percent-multiple-loc <float>: is, to consider a given read as “multiple”, the minimum proportion of k-mers that are multiple mapped on the index (0.50 by default)
--min-repetition <int>: is the minimum number of locations to consider a repeated k-mer (20 by default)
--max-percent-repetition-loc <float>: is, for a given read, the minimum proportion of k-mers that are repeated on the index to consider a repetition (0.20 by default)
--max-splice-length <int>: is the threshold to consider a splice, ie. a splice is reported if the junction length is below max-splice-length <int>, a chimera is considered otherwise (distance by default is 300Kb)
--max-bio-indel <int>: is the threshold to consider a biological indel, ie. an indel is reported if the gap length is below max-bio-indel, a splice is considered otherwise (distance by default is 15)
--max-bases-retrieved <int>: is the number of nucleotides to display in outputfile in case of insertion (15 by default)
--min-support-no-cover <float>: is the minimum coverage to be able to report a biological cause. Note that if a single read contains a given substitution, it is difficult (if not impossible) to distinguish a sequence error and a biological cause (1.30 by default)

Additional settings for advanced users

--min-break-length <int>

is the minimal break length (as the percentage of k, the k-mer length) so that a cause can be reported. Theoretically, for a given cause, the break length is always >= (kmer_length - 1). Otherwise, the break may be merged with a close enough break, or the break will be considered as undetermined. (0.5 by default)

--max-bases-randomly-matched <int>

A k-mer overlapping an exon-exon junction, for example, may still match on the genome if the overlap is at the end of the read (without loss of generality). This is due to the fact that the nucleotides starting the second exon may be the same as the nucleotides starting the intron. Theoretically, there is a 0.25 probability that we have the same nucleotide at the first position of the intron and the exon. This option specifies how many nucleotides may be matched randomly at most

--max-extension-length <int>

is the maximum number of k-mers extended at each side of a read break. In fact, for a given break, k-mers with false locations can generate false biological causes, so the consistency is checked for each side of the break to discard false k-mers and readjust the good boundaries of the break (10 by default)

--nb-tags-info-stored <int>

is a buffer to store informations for each thread during the computing phase (1000 by default). This value must be increased if threads work below their real capabilities. With --nb-threads 15, CPU usage must be about 1400%

--reads-index <string>

the reads index data-structure uses by CRAC. Available reads index are: JELLYFISH and GKARRAYS. (JELLYFISH by default).

--nb-nucleotides-snp-comparison <int>

is the minimum k-mer length tolerated for the deep SNVs search (8 by
default)

--max-number-of-merges <int>

is the maximum number of merges tolerated during the break merge process
for the chimera detection (4 by default)

--min-score-chimera-stringent <float>

is the mimimal score to consider a chimera event
otherwise it is classify as a bioundetermined event (0.6 by default)

AUTHOR

About the crac package.

You can contact Nicolas PHILIPPE, Mikael SALSON, Jerome AUDOUX and Alban MANCHERON by sending an e-mail to <crac-bugs@lists.gforge.inria.fr>.

Programming:: Nicolas PHILIPPE <nphilippe.research@gmail.com>
Mikaël SALSON <mikael.salson@lifl.fr> Jérome AUDOUX <jerome.audoux@gmail.com>
with additional contribution for the packaging of:
Alban MANCHERON <alban.mancheron@lirmm.fr>

About the crac publication.

You may cite the following paper if you use our tool:

Gk-arrays: Querying large read collections in main memory: a versatile: data structure
Philippe N., Salson M., Lecroq T., Leonard M., Commes T., Rivals E.
BMC Bioinformatics 2011, 12:242.
Crac: An integrated RNA-Seq read analysis: Philippe N., Salson M., Commes T., Rivals E.
Genome Biology 2013; 14:R30.

2018-09-13