NAME

artfastqgenerator - outputs artificial FASTQ files derived from a reference genome

SYNOPSIS

artfastqgenerator -O <outputPath> -R <referenceGenomePath> -S <startSequenceIdentifier> -F1 <fastq1ForQualityScores> -F2 <fastq2ForQualityScores> -CMGCS <coverageMeanGCcontentSpread> -CMP <coverageMeanPeak> -CMPGC <coverageMeanPeakGCcontent> -CSD <coverageSD> -E <endSequenceIdentifier> -GCC <GCcontentBasedCoverage> -GCR <GCcontentRegionSize> -L <logRegionStats> -N <nucleobaseBufferSize> -OF <outputFormat> -RCNF <readsContainingNfilter> -RL <readLength> -SE <simulateErrorInRead> -TLM <templateLengthMean> -TLSD <templateLengthSD> -URQS <useRealQualityScores> -X <xStart> -Y <yStart>

DESCRIPTION

ArtificialFastqGenerator takes the reference genome (in FASTA format) as input and outputs artificial FASTQ files in the Sanger format. It can accept Phred base quality scores from existing FASTQ files, and use them to simulate sequencing errors. Since the artificial FASTQs are derived from the reference genome, the reference genome provides a gold-standard for calling variants (Single Nucleotide Polymorphisms (SNPs) and insertions and deletions (indels)). This enables evaluation of a Next Generation Sequencing (NGS) analysis pipeline which aligns reads to the reference genome and then calls the variants.

OPTIONS

-h: Print usage help.
-O, <outputPath>: Path for the artificial fastq and log files, including their base name (must be specified).
-R, <referenceGenomePath>: Reference genome sequence file, (must be specified).
-S, <startSequenceIdentifier>: Prefix of the sequence identifier in the reference after which read generation should begin (must be specified).
-F1, <fastq1ForQualityScores>: First fastq file to use for real quality scores, (must be specified if useRealQualityScores = true).
-F2, <fastq2ForQualityScores>: Second fastq file to use for real quality scores, (must be specified if useRealQualityScores = true).
-CMGCS, <coverageMeanGCcontentSpread>: The spread of coverage mean given GC content (default = 0.22).
-CMP, <coverageMeanPeak>: The peak coverage mean for a region (default = 37.7).
-CMPGC, <coverageMeanPeakGCcontent>: The GC content for regions with peak coverage mean (default = 0.45).
-CSD, <coverageSD>: The coverage standard deviation divided by the mean (default = 0.2).
-E, <endSequenceIdentifier>: Prefix of the sequence identifier in the reference where read generation should stop, (default = end of file).
-GCC, <GCcontentBasedCoverage>: Whether nucleobase coverage is biased by GC content (default = true).
-GCR, <GCcontentRegionSize>: Region size in nucleobases for which to calculate GC content, (default = 150).
-L, <logRegionStats>: The region size as a multiple of -NBS for which summary coverage statistics are recorded (default = 2).
-N, <nucleobaseBufferSize>: The number of reference sequence nucleobases to buffer in memory, (default = 5000).
-OF, <outputFormat>: 'default': standard fastq output; 'debug_nucleobases(_nuc|read_ids)': debugging.
-RCNF, <readsContainingNfilter>: Filter out no "N-containing" reads (0), "all-N" reads (1), "at-least-1-N" reads (2), (default = 0).
-RL, <readLength>: The length of each read, (default = 76).
-SE, <simulateErrorInRead>: Whether to simulate error in the read based on the quality scores, (default = false).
-TLM, <templateLengthMean>: The mean DNA template length, (default = 210).
-TLSD, <templateLengthSD>: The standard deviation of the DNA template length, (default = 60).
-URQS, <useRealQualityScores>: Whether to use real quality scores from existing fastq files or set all to the maximum, (default = false).
-X, <xStart>: The first read's X coordinate, (default = 1000).
-Y, <yStart>: The first read's Y coordinate, (default = 1000).

BUGS

Any bugs should be reported to Matthew.Frampton@icr.ac.uk

AUTHOR

This manpage was written by Andreas Tille for the Debian distribution and can be used for any other usage of the program.

February 2016

artfastqgenerator 0.0.20150519