artfastqgenerator - outputs artificial FASTQ files derived from a
reference genome
artfastqgenerator -O <outputPath> -R
<referenceGenomePath> -S <startSequenceIdentifier>
-F1 <fastq1ForQualityScores> -F2
<fastq2ForQualityScores> -CMGCS
<coverageMeanGCcontentSpread> -CMP <coverageMeanPeak>
-CMPGC <coverageMeanPeakGCcontent> -CSD
<coverageSD> -E <endSequenceIdentifier> -GCC
<GCcontentBasedCoverage> -GCR <GCcontentRegionSize>
-L <logRegionStats> -N <nucleobaseBufferSize>
-OF <outputFormat> -RCNF <readsContainingNfilter>
-RL <readLength> -SE <simulateErrorInRead>
-TLM <templateLengthMean> -TLSD <templateLengthSD>
-URQS <useRealQualityScores> -X <xStart> -Y
<yStart>
ArtificialFastqGenerator takes the reference genome (in FASTA
format) as input and outputs artificial FASTQ files in the Sanger format. It
can accept Phred base quality scores from existing FASTQ files, and use them
to simulate sequencing errors. Since the artificial FASTQs are derived from
the reference genome, the reference genome provides a gold-standard for
calling variants (Single Nucleotide Polymorphisms (SNPs) and insertions and
deletions (indels)). This enables evaluation of a Next Generation Sequencing
(NGS) analysis pipeline which aligns reads to the reference genome and then
calls the variants.
- -h
- Print usage help.
- -O,
<outputPath>
- Path for the artificial fastq and log files, including their base name
(must be specified).
- -R,
<referenceGenomePath>
- Reference genome sequence file, (must be specified).
- -S,
<startSequenceIdentifier>
- Prefix of the sequence identifier in the reference after which read
generation should begin (must be specified).
- -F1,
<fastq1ForQualityScores>
- First fastq file to use for real quality scores, (must be specified if
useRealQualityScores = true).
- -F2,
<fastq2ForQualityScores>
- Second fastq file to use for real quality scores, (must be specified if
useRealQualityScores = true).
- -CMGCS,
<coverageMeanGCcontentSpread>
- The spread of coverage mean given GC content (default = 0.22).
- -CMP,
<coverageMeanPeak>
- The peak coverage mean for a region (default = 37.7).
- -CMPGC,
<coverageMeanPeakGCcontent>
- The GC content for regions with peak coverage mean (default = 0.45).
- -CSD,
<coverageSD>
- The coverage standard deviation divided by the mean (default = 0.2).
- -E,
<endSequenceIdentifier>
- Prefix of the sequence identifier in the reference where read generation
should stop, (default = end of file).
- -GCC,
<GCcontentBasedCoverage>
- Whether nucleobase coverage is biased by GC content (default = true).
- -GCR,
<GCcontentRegionSize>
- Region size in nucleobases for which to calculate GC content, (default =
150).
- -L,
<logRegionStats>
- The region size as a multiple of -NBS for which summary coverage
statistics are recorded (default = 2).
- -N,
<nucleobaseBufferSize>
- The number of reference sequence nucleobases to buffer in memory, (default
= 5000).
- -OF,
<outputFormat>
-
'default': standard fastq output; 'debug_nucleobases(_nuc|read_ids)':
debugging.
- -RCNF,
<readsContainingNfilter>
- Filter out no "N-containing" reads (0), "all-N" reads
(1), "at-least-1-N" reads (2), (default = 0).
- -RL,
<readLength>
- The length of each read, (default = 76).
- -SE,
<simulateErrorInRead>
- Whether to simulate error in the read based on the quality scores,
(default = false).
- -TLM,
<templateLengthMean>
- The mean DNA template length, (default = 210).
- -TLSD,
<templateLengthSD>
- The standard deviation of the DNA template length, (default = 60).
- -URQS,
<useRealQualityScores>
- Whether to use real quality scores from existing fastq files or set all to
the maximum, (default = false).
- -X, <xStart>
- The first read's X coordinate, (default = 1000).
- -Y, <yStart>
- The first read's Y coordinate, (default = 1000).
Any bugs should be reported to Matthew.Frampton@icr.ac.uk
This manpage was written by Andreas Tille for the Debian
distribution and can be used for any other usage of the program.