art_illumina - Simulation of Illumina sequencers
ART is a set of simulation tools to generate synthetic
next-generation sequencing reads. ART simulates sequencing reads by
mimicking real sequencing process with empirical error models or quality
profiles summarized from large recalibrated sequencing data.
art_illumina can be used for Simulation of Illumina sequencers
art_illumina [options] -sam -i
<seq_ref_file> -l <read_length> -f
<fold_coverage> -ss <sequencing_system> -o
<outfile_prefix>
art_illumina [options] -sam -i
<seq_ref_file> -l <read_length> -f
<fold_coverage> -o <outfile_prefix>
art_illumina [options] -sam -i
<seq_ref_file> -l <read_length> -c
<total_num_reads> -o <outfile_prefix>
art_illumina [options] -sam -i
<seq_ref_file> -l <read_length> -f
<fold_coverage> -m <mean_fragsize> -s
<std_fragsize> -o <outfile_prefix>
art_illumina [options] -sam -i
<seq_ref_file> -l <read_length> -c
<total_num_reads> -m <mean_fragsize> -s
<std_fragsize> -o <outfile_prefix>
- -1 --qprof1
- the first-read quality profile
- -2 --qprof2
- the second-read quality profile
-amp --amplicon amplicon sequencing
simulation
- -c --rcount
- total number of reads/read pairs to be generated [per amplicon if for
amplicon simulation](not be used together with -f/--fcov)
- -d --id
- the prefix identification tag for read ID
- -ef
--errfree
- indicate to generate the zero sequencing errors SAM file as well the
regular one
- NOTE: the reads in the zero-error SAM file have the same alignment
positions as those in the regular SAM file, but have no sequencing
errors
- -f --fcov
- the fold of read coverage to be simulated or number of reads/read pairs
generated for each amplicon
- -h --help
- print out usage information
- -i --in
- the filename of input DNA/RNA reference
- -ir
--insRate
- the first-read insertion rate (default: 0.00009)
-ir2 --insRate2 the second-read insertion rate
(default: 0.00015)
- -dr
--delRate
- the first-read deletion rate (default: 0.00011)
-dr2 --delRate2 the second-read deletion rate
(default: 0.00023)
- -l --len
- the length of reads to be simulated
- -m --mflen
- the mean size of DNA/RNA fragments for paired-end simulations
-mp --matepair indicate a mate-pair read
simulation
- -nf --maskN
- the cutoff frequency of 'N' in a window size of the read length for
masking genomic regions
- NOTE: default: '-nf 1' to mask all regions with 'N'. Use '-nf 0' to turn
off masking
- -na --noALN
- do not output ALN alignment file
- -o --out
- the prefix of output filename
- -p --paired
- indicate a paired-end read simulation or to generate reads from both ends
of amplicons
- NOTE: art will automatically switch to a mate-pair simulation if the given
mean fragment size >= 2000
- -q --quiet
- turn off end of run summary
- -qs
--qShift
- the amount to shift every first-read quality score by
- -qs2
--qShift2
- the amount to shift every second-read quality score by
- NOTE: For -qs/-qs2 option, a positive number will shift up quality
scores (the max is 93) that reduce substitution sequencing errors and a
negative number will shift down quality scores that increase sequencing
errors. If shifting scores by x, the error rate will be 1/(10^(x/10)) of
the default profile.
- -rs
--rndSeed
- the seed for random number generator (default: system time in second)
- NOTE: using a fixed seed to generate two identical datasets from different
runs
- -s --sdev
- the standard deviation of DNA/RNA fragment size for paired-end
simulations.
- -sam
--samout
- indicate to generate SAM alignment file
- -sp
--sepProf
- indicate to use separate quality profiles for different bases (ATGC)
- -ss
--seqSys
- The name of Illumina sequencing system of the built-in profile used for
simulation
- NOTE: sequencing system id names are:
- GA1 - Genome Analyzer I, GA2 -
Genome Analyzer II
- HS10 - HiSeq 1000, HS20 -
HiSeq 2000, HS25 - HiSeq 2500, MS - MiSeq
- -M --cigarM
- indicate to use CIGAR 'M' instead of '=/X' for alignment
match/mismatch
* ART by default selects a built-in quality score profile
according to the read length specified for the run.
* For single-end simulation, ART requires input sequence file,
outputfile prefix, read length, and read count/fold coverage.
* For paired-end simulation (except for amplicon sequencing), ART
also requires the parameter values of
- the mean and standard deviation of DNA/RNA fragment lengths
- 1) single-end read simulation
- art_illumina -sam -i reference.fa -l 150 -ss
HS25 -f 10 -o single_dat
- 2) paired-end read simulation
- art_illumina -sam -i reference.fa -p -l 150
-ss HS25 -f 20 -m 200 -s 10 -o
paired_dat
- 3) mate-pair read simulation
- art_illumina -sam -i reference.fa -mp -l 50
-f 20 -m 2500 -s 50 -o matepair_dat
- 4) amplicon sequencing simulation with 5' end single-end reads
- art_illumina -amp -sam -na -i amp_reference.fa
-l 50 -f 10 -o amplicon_5end_dat
- 5) amplicon sequencing simulation with paired-end reads
- art_illumina -amp -p -sam -na -i
amp_reference.fa -l 50 -f 10 -o
amplicon_pair_dat
- 6) amplicon sequencing simulation with matepair reads
- art_illumina -amp -mp -sam -na -i
amp_reference.fa -l 50 -f 10 -o
amplicon_mate_dat
- 7) generate an extra SAM file with zero-sequencing errors for a paired-end
read simulation
- art_illumina -ef -i reference.fa -p -l 50
-f 20 -m 200 -s 10 -o paired_twosam_dat
- 8) reduce the substitution error rate to one 10th of the default
profile
- art_illumina -i reference.fa -qs 10 -qs2 10 -l
50 -f 10 -p -m 500 -s 10 -sam -o
reduce_error
- 9) turn off the masking of genomic regions with unknown nucleotides
'N'
- art_illumina -nf 0 -sam -i reference.fa -p
-l 50 -f 20 -m 200 -s 10 -o
paired_nomask
- 10) masking genomic regions with >=5 'N's within the read length
50
- art_illumina -nf 5 -sam -i reference.fa -p
-l 50 -f 20 -m 200 -s 10 -o
paired_maskN5
This manpage was written by Andreas Tille for the Debian
distribution and can be used for any other usage of the program.