soapdenovo(1) | USER COMMANDS | soapdenovo(1) |
soapdenovo - Short-read assembly method that can build a de novo draft assembly
soapdenovo_31mer soapdenovo_63mer soapdenovo_127mer
SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way.
1) Support large kmer up to 127 to utilize long reads. Three
version are provided.
I. The 31mer version support kmer only <=31.
II. The 63mer version support kmer only <=63 and doubles the memory
consumption than 31mer version, even being used with kmer <=31.
III. The 127mer version support kmer only <=127 and double the memory
consumption than 63mer version, even being used with kmer <=63.
Please notice that, with longer kmer, the quantity of nodes would decrease significantly, thus the memory consumption is usually smaller than double with shifted version.
2) New parameter added in "pregraph" module. This parameter initiates the memory assumption to avoid further reallocation. Unit of the parameter is GB. Without further reallocation, SOAPdenovo runs faster and provide the potential to eat up all the memory of the machine. For example, if the workstation provides 50g free memory, use -a 50 in pregraph step, then a static amount of 50g memory would be allocated before processing reads. This can also avoid being interrupted by other users sharing the same machine.
3) Gap filled bases now represented by lowercase characters in 'scafSeq' file.
4) Introduced SIMD instructions to boost the performance.
For big genome projects with deep sequencing, the data is usually organized as multiple read sequence files generated from multiple libraries. The configuration file tells the assembler where to find these files and the relevant information. “example.config” is an example of such a file.
The configuration file has a section for global information, and then multiple library sections. Right now only “max_rd_len” is included in the global information section. Any read longer than max_rd_len will be cut to this length.
The library information and the information of sequencing data generated from the library should be organized in the corresponding library section. Each library section starts with tag [LIB] and includes the following items:
The assembler accepts read file in two formats: FASTA or FASTQ. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair.
In the configuration file single end files are indicated by “f=/path/filename” or “q=/pah/filename” for fasta or fastq formats separately. Paired reads in two fasta sequence files are indicated by “f1=” and “f2=”. While paired reads in two fastq sequences files are indicated by “q1=” and “q2=”. Paired reads in a single fasta sequence file is indicated by “p=” item.
All the above items in each library section are optional. The assembler assigns default values for most of them. If you are not sure how to set a parameter, you can remove it from your configuration file.
Once the configuration file is available, a typical way to run the assembler is: ${bin} all –s config_file –K 63 –R –o graph_prefix
User can also choose to run the assembly process step by step as: ${bin} pregraph \[u2013]s config_file \[u2013]K 63 [\[u2013]R -d \[u2013]p -a] \[u2013]o graph_prefix ${bin} contig \[u2013]g graph_prefix [\[u2013]R \[u2013]M 1 -D] ${bin} map \[u2013]s config_file \[u2013]g graph_prefix [-p] ${bin} scaff \[u2013]g graph_prefix [\[u2013]F -u -G -p]
These files are output as assembly results:
There are some other files that provide useful information for advanced users, which are listed in Appendix B.
The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of uniqueness in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location.
SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank would be used at the same time. For example, in a dataset of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome.
#maximal read length
max_rd_len=50
[LIB]
#average insert size
avg_ins=200
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 50 bps of each read
rd_len_cutoff=50
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (default 3)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (default 32)
map_len=32
#fastq file for read 1
q1=/path/**LIBNAMEA**/fastq_read_1.fq
#fastq file for read 2 always follows fastq file for read 1
q2=/path/**LIBNAMEA**/fastq_read_2.fq
#fasta file for read 1
f1=/path/**LIBNAMEA**/fasta_read_1.fa
#fastq file for read 2 always follows fastq file for read 1
f2=/path/**LIBNAMEA**/fasta_read_2.fa
#fastq file for single reads
q=/path/**LIBNAMEA**/fastq_read_single.fq
#fasta file for single reads
f=/path/**LIBNAMEA**/fasta_read_single.fa
#a single fasta file for paired reads
p=/path/**LIBNAMEA**/pairs_in_one_file.fa
[LIB]
avg_ins=2000
reverse_seq=1
asm_flags=2
rank=2
# cutoff of pair number for a reliable connection
#(default 5 for large insert size)
pair_num_cutoff=5
#minimum aligned length to contigs for a reliable read location
#(default 35 for large insert size)
map_len=35
q1=/path/**LIBNAMEB**/fastq_read_1.fq
q2=/path/**LIBNAMEB**/fastq_read_2.fq
q=/path/**LIBNAMEB**/fastq_read_single.fq
f=/path/**LIBNAMEB**/fasta_read_single.fa
1. Output files from the command “pregraph”
2. Output files from the command “contig”
3. Output files from the command “map”
4. Output files from the command “scaff”
Olivier Sallou (olivier.sallou (at) irisa.fr) - Man page and packaging
July 30, 2012 | version 1.1.0 |