Ray - assemble genomes in parallel using the message-passing
interface
mpiexec -n NUMBER_OF_RANKS Ray -k KMERLENGTH
-p l1_1.fastq l1_2.fastq -p l2_1.fastq l2_2.fastq -o
test
mpiexec -n NUMBER_OF_RANKS Ray Ray.conf # with
commands in a file
The Ray genome assembler is built on top of the RayPlatform, a
generic plugin-based distributed and parallel compute engine that uses the
message-passing interface for passing messages.
Ray targets several applications:
- - de novo genome assembly (with Ray vanilla) - de novo meta-genome
assembly (with Ray Meta) - de novo transcriptome assembly (works, but not
tested a lot) - quantification of contig abundances - quantification of
microbiome consortia members (with Ray Communities) - quantification of
transcript expression - taxonomy profiling of samples (with Ray
Communities) - gene ontology profiling of samples (with Ray
Ontologies)
-help
- Displays this help page.
-version
- Displays Ray version and compilation options.
- Using a configuration file
- Ray can be launched with mpiexec -n 16 Ray Ray.conf The
configuration file can include comments (starting with #).
- K-mer length
-k kmerLength
- Selects the length of k-mers. The default value is 21. It must be odd
because reverse-complement vertices are stored together. The maximum
length is defined at compilation by MAXKMERLENGTH Larger k-mers utilise
more memory.
- Inputs
-p leftSequenceFile rightSequenceFile
[averageOuterDistance standardDeviation]
- Provides two files containing paired-end reads. averageOuterDistance and
standardDeviation are automatically computed if not provided.
-i interleavedSequenceFile [averageOuterDistance
standardDeviation]
- Provides one file containing interleaved paired-end reads.
averageOuterDistance and standardDeviation are automatically computed if
not provided.
-s sequenceFile
- Provides a file containing single-end reads.
- Outputs
-o outputDirectory
- Specifies the directory for outputted files. Default is RayOutput
- Assembly options (defaults work well)
-disable-recycling
- Disables read recycling during the assembly reads will be set free in 3
cases: 1. the distance did not match for a pair 2. the read has not met
its mate 3. the library population indicates a wrong placement see
Constrained traversal of repeats with paired sequences. Sebastien
Boisvert, Elenie Godzaridis, Francois Laviolette & Jacques Corbeil.
First Annual RECOMB Satellite Workshop on Massively Parallel Sequencing,
March 26-27 2011, Vancouver, BC, Canada.
-disable-scaffolder
- Disables the scaffolder.
-minimum-contig-length minimumContigLength
- Changes the minimum contig length, default is 100 nucleotides
-color-space
- Runs in color-space Needs csfasta files. Activated automatically if
csfasta files are provided.
-use-maximum-seed-coverage maximumSeedCoverageDepth
- Ignores any seed with a coverage depth above this threshold. The default
is 4294967295.
-use-minimum-seed-coverage minimumSeedCoverageDepth
- Sets the minimum seed coverage depth. Any path with a coverage depth lower
than this will be discarded. The default is 0.
- Distributed storage engine (all these values are for each MPI rank)
-bloom-filter-bits bits
- Sets the number of bits for the Bloom filter Default is 268435456 bits, 0
bits disables the Bloom filter.
-hash-table-buckets buckets
- Sets the initial number of buckets. Must be a power of 2 ! Default value:
268435456
-hash-table-buckets-per-group buckets
- Sets the number of buckets per group for sparse storage Default value: 64,
Must be between >=1 and <= 64
-hash-table-load-factor-threshold threshold
- Sets the load factor threshold for real-time resizing Default value: 0.75,
must be >= 0.5 and < 1
-hash-table-verbosity
- Activates verbosity for the distributed storage engine
- Biological abundances
-search searchDirectory
- Provides a directory containing fasta files to be searched in the de
Bruijn graph. Biological abundances will be written to
RayOutput/BiologicalAbundances See
Documentation/BiologicalAbundances.txt
-one-color-per-file
- Sets one color per file instead of one per sequence. By default, each
sequence in each file has a different color. For files with large numbers
of sequences, using one single color per file may be more efficient.
- Taxonomic profiling with colored de Bruijn graphs
-with-taxonomy Genome-to-Taxon.tsv TreeOfLife-Edges.tsv
Taxon-Names.tsv
- Provides a taxonomy. Computes and writes detailed taxonomic profiles. See
Documentation/Taxonomy.txt for details.
- -gene-ontology
OntologyTerms.txt
- Annotations.txt
- Provides an ontology and annotations. OntologyTerms.txt is fetched from
http://geneontology.org Annotations.txt is a 2-column file (EMBL_CDS
handle & gene ontology identifier) See
Documentation/GeneOntology.txt
- Other outputs
-enable-neighbourhoods
- Computes contig neighborhoods in the de Bruijn graph Output file:
RayOutput/NeighbourhoodRelations.txt
-amos
- Writes the AMOS file called RayOutput/AMOS.afg An AMOS file contains read
positions on contigs. Can be opened with software with graphical user
interface.
-write-kmers
- Writes k-mer graph to RayOutput/kmers.txt The resulting file is not
utilised by Ray. The resulting file is very large.
-write-read-markers
- Writes read markers to disk.
-write-seeds
- Writes seed DNA sequences to
RayOutput/Rank<rank>.RaySeeds.fasta
-write-extensions
- Writes extension DNA sequences to
RayOutput/Rank<rank>.RayExtensions.fasta
-write-contig-paths
- Writes contig paths with coverage values to
RayOutput/Rank<rank>.RayContigPaths.txt
-write-marker-summary
- Writes marker statistics.
- Memory usage
-show-memory-usage
- Shows memory usage. Data is fetched from /proc on GNU/Linux Needs
__linux__
-show-memory-allocations
- Shows memory allocation events
- Algorithm verbosity
-show-extension-choice
- Shows the choice made (with other choices) during the extension.
-show-ending-context
- Shows the ending context of each extension. Shows the children of the
vertex where extension was too difficult.
-show-distance-summary
- Shows summary of outer distances used for an extension path.
-show-consensus
- Shows the consensus when a choice is done.
- Checkpointing
-write-checkpoints checkpointDirectory
- Write checkpoint files
-read-checkpoints checkpointDirectory
- Read checkpoint files
-read-write-checkpoints checkpointDirectory
- Read and write checkpoint files
- Message routing for large number of cores
-route-messages
- Enables the Ray message router. Disabled by default. Messages will be
routed accordingly so that any rank can communicate directly with only a
few others. Without -route-messages, any rank can communicate
directly with any other rank. Files generated: Routing/Connections.txt,
Routing/Routes.txt and Routing/RelayEvents.txt and
Routing/Summary.txt
-connection-type type
- Sets the connection type for routes. Accepted values are debruijn,
hypercube, polytope, group, random, kautz and complete. Default is
debruijn.
- debruijn: a full de Bruijn graph a given alphabet and diameter hypercube:
a hypercube, alphabet is {0,1} and the vertices is a power of 2 polytope:
a convex regular polytope, alphabet is {0,1,...,B-1} and the vertices is a
power of B group: silly model where one representative per group can
communicate with outsiders random: Erdos-Renyi model kautz: a full de
Kautz graph, which is a subgraph of a de Bruijn graph complete: a full
graph with all the possible connections
- With the type debruijn, the number of ranks must be a power of something.
Examples: 256 = 16*16, 512=8*8*8, 49=7*7, and so on. Otherwise, don't use
debruijn routing but use another one With the type kautz, the number of
ranks n must be n=(k+1)*k^(d-1) for some k and d
-routing-graph-degree degree
- Specifies the outgoing degree for the routing graph. See
Documentation/Routing.txt
- Hardware testing
-test-network-only
- Tests the network and returns.
-write-network-test-raw-data
- Writes one additional file per rank detailing the network test.
-exchanges NumberOfExchanges
- Sets the number of exchanges
-disable-network-test
- Skips the network test.
- Debugging
-verify-message-integrity
- Checks message data reliability for any non-empty message. add '-D
CONFIG_SSE_4_2' in the Makefile to use hardware instruction (SSE 4.2)
-run-profiler
- Runs the profiler as the code runs. By default, only show granularity
warnings. Running the profiler increases running times.
-with-profiler-details
- Shows number of messages sent and received in each methods during in each
time slices (epochs). Needs -run-profiler.
-show-communication-events
- Shows all messages sent and received.
-show-read-placement
- Shows read placement in the graph during the extension.
-debug-bubbles
- Debugs bubble code. Bubbles can be due to heterozygous sites or sequencing
errors or other (unknown) events
-debug-seeds
- Debugs seed code. Seeds are paths in the graph that are likely
unique.
-debug-fusions
- Debugs fusion code.
-debug-scaffolder
- Debug the scaffolder.
FILES
- Input files
- Note: file format is determined with file extension.
- .fasta .fasta.gz (needs HAVE_LIBZ=y at compilation) .fasta.bz2 (needs
HAVE_LIBBZ2=y at compilation) .fastq .fastq.gz (needs HAVE_LIBZ=y at
compilation) .fastq.bz2 (needs HAVE_LIBBZ2=y at compilation) .sff (paired
reads must be extracted manually) .csfasta (color-space reads)
- Outputted files
- Scaffolds
- RayOutput/Scaffolds.fasta
- The scaffold sequences in FASTA format
- RayOutput/ScaffoldComponents.txt
- The components of each scaffold
- RayOutput/ScaffoldLengths.txt
- The length of each scaffold
- RayOutput/ScaffoldLinks.txt
- Scaffold links
- Contigs
- RayOutput/Contigs.fasta
- Contiguous sequences in FASTA format
- RayOutput/ContigLengths.txt
- The lengths of contiguous sequences
- Summary
- RayOutput/OutputNumbers.txt
- Overall numbers for the assembly
- de Bruijn graph
- RayOutput/CoverageDistribution.txt
- The distribution of coverage values
- RayOutput/CoverageDistributionAnalysis.txt
- Analysis of the coverage distribution
- RayOutput/degreeDistribution.txt
- Distribution of ingoing and outgoing degrees
- RayOutput/kmers.txt
- k-mer graph, required option: -write-kmers
- The resulting file is not utilised by Ray. The resulting file is very
large.
- Assembly steps
- RayOutput/SeedLengthDistribution.txt
- Distribution of seed length
- RayOutput/Rank<rank>.OptimalReadMarkers.txt
- Read markers.
- RayOutput/Rank<rank>.RaySeeds.fasta
- Seed DNA sequences, required option: -write-seeds
- RayOutput/Rank<rank>.RayExtensions.fasta
- Extension DNA sequences, required option: -write-extensions
- RayOutput/Rank<rank>.RayContigPaths.txt
- Contig paths with coverage values, required option:
-write-contig-paths
- Paired reads
- RayOutput/LibraryStatistics.txt
- Estimation of outer distances for paired reads
- RayOutput/Library<LibraryNumber>.txt
- Frequencies for observed outer distances (insert size + read lengths)
- Partition
- RayOutput/NumberOfSequences.txt
- Number of reads in each file
- RayOutput/SequencePartition.txt
- Sequence partition
- Ray software
- RayOutput/RayVersion.txt
- The version of Ray
- RayOutput/RayCommand.txt
- The exact same command provided
- AMOS
- RayOutput/AMOS.afg
- Assembly representation in AMOS format, required option: -amos
- Communication
- RayOutput/MessagePassingInterface.txt
- Number of messages sent
- RayOutput/NetworkTest.txt
- Latencies in microseconds
- RayOutput/Rank<rank>NetworkTestData.txt
- Network test raw data
DOCUMENTATION
- - mpiexec -n 1 Ray -help|less (always up-to-date) - This
help page (always up-to-date) - The directory Documentation/ - Manual
(Portable Document Format): InstructionManual.tex (in Documentation) -
Mailing list archives:
http://sourceforge.net/mailarchive/forum.php?forum_name=denovoassembler-users
AUTHOR
- Written by Sebastien Boisvert.
REPORTING BUGS
- Report bugs to denovoassembler-users@lists.sourceforge.net Home page:
<http://denovoassembler.sourceforge.net/>
COPYRIGHT
- This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation, version 3 of the License.
- This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
- You have received a copy of the GNU General Public License along with this
program (see LICENSE).
Ray 2.1.0
License for Ray: GNU General Public License version 3 RayPlatform
version: 1.1.0 License for RayPlatform: GNU Lesser General Public License
version 3
MAXKMERLENGTH: 32 KMER_U64_ARRAY_SIZE: 1 Maximum coverage depth
stored by CoverageDepth: 4294967295 MAXIMUM_MESSAGE_SIZE_IN_BYTES: 4000
bytes FORCE_PACKING = n ASSERT = n HAVE_LIBZ = y HAVE_LIBBZ2 = y
CONFIG_PROFILER_COLLECT = n CONFIG_CLOCK_GETTIME = n __linux__ = y _MSC_VER
= n __GNUC__ = y RAY_32_BITS = n RAY_64_BITS = y MPI standard version: MPI
2.1 MPI library: Open-MPI 1.4.2 Compiler: GNU gcc/g++ 4.4.5