pilon - automated genome assembly improvement and variant
detection tool
pilon --genome genome.fasta [--frags frags.bam]
[--jumps jumps.bam] [--unpaired unpaired.bam] [...other options...]
Pilon is a software tool which can be used to:
- •
- Automatically improve draft assemblies
- •
- Find variation among strains, including large event detection
Pilon requires as input a FASTA file of the genome along with one
or more BAM files of reads aligned to the input FASTA file. Pilon uses read
alignment analysis to identify inconsistencies between the input genome and
the evidence in the reads. It then attempts to make improvements to the
input genome, including:
- •
- Single base differences
- •
- Small indels
- •
- Larger indel or block substitution events
- •
- Gap filling
- •
- Identification of local misassemblies, including optional opening of new
gaps
- --help
- for option details
- --genome genome.fasta
- The input genome we are trying to improve, which must be the reference
used for the bam alignments. At least one of --frags or
--jumps must also be given.
- --frags frags.bam
- A bam file consisting of fragment paired-end alignments, aligned to the
--genome argument using bwa or bowtie2. This argument may be
specified more than once.
- --jumps jumps.bam
- A bam file consisting of jump (mate pair) paired-end alignments, aligned
to the --genome argument using bwa or bowtie2. This argument may be
specified more than once.
- --unpaired unpaired.bam
- A bam file consisting of unpaired alignments, aligned to the
--genome argument using bwa or bowtie2. This argument may be
specified more than once.
- --bam any.bam
- A bam file of unknown type; Pilon will scan it and attempt to classify it
as one of the above bam types.
- --output prefix
- Prefix for output files
- --outdir directory
- Use this directory for all output files.
- --changes
- If specified, a file listing changes in the <output>.fasta will be
generated.
- --vcf
- If specified, a vcf file will be generated
- --vcfqe
- If specified, the VCF will contain a QE (quality-weighted evidence) field
rather than the default QP (quality-weighted percentage of evidence)
field.
- --tracks
- This options will cause many track files (*.bed, *.wig) suitable for
viewing in a genome browser to be written.
- --variant
- Sets up heuristics for variant calling, as opposed to assembly
improvement; equivalent to "--vcf --fix all,breaks".
- --chunksize
- Input FASTA elements larger than this will be processed in smaller pieces
not to exceed this size (default 10000000).
- --diploid
- Sample is from diploid organism; will eventually affect calling of
heterozygous SNPs
- --fix fixlist
- A comma-separated list of categories of issues to try to fix:
- "snps": try to fix individual base errors; "indels":
try to fix small indels; "gaps": try to fill gaps;
"local": try to detect and fix local misassemblies;
"all": all of the above (default); "bases": shorthand
for "snps" and "indels" (for back compatibility);
"none": none of the above; new fasta file will not be
written.
- The following are experimental fix types:
- "amb": fix ambiguous bases in fasta output (to most likely
alternative); "breaks": allow local reassembly to open new gaps
(with "local"); "circles": try to close circlar
elements when used with long corrected reads; "novel": assemble
novel sequence from unaligned non-jump reads.
- --dumpreads
- Dump reads for local re-assemblies.
- --duplicates
- Use reads marked as duplicates in the input BAMs (ignored by
default).
- --iupac
- Output IUPAC ambiguous base codes in the output FASTA file when
appropriate.
- --nonpf
- Use reads which failed sequencer quality filtering (ignored by
default).
- --targets targetlist
- Only process the specified target(s). Targets are comma-separated, and
each target
- is a fasta element name optionally followed by a base range. Example:
"scaffold00001,scaffold00002:10000-20000" would result in
processing all of scaffold00001 and coordinates 10000-20000 of
scaffold00002. If "targetlist" is the name of a file, each line
will be treated as a target specification.
- --threads
- Degree of parallelism to use for certain processing (default 1).
Experimental.
- --verbose
- More verbose output.
- --debug
- Debugging output (implies verbose).
- --version
- Print version string and exit.
- --defaultqual qual
- Assumes bases are of this quality if quals are no present in input BAMs
(default 15).
- --flank nbases
- Controls how much of the well-aligned reads will be used; this many bases
at each end of the good reads will be ignored (default 10).
- --gapmargin
- Closed gaps must be within this number of bases of true size to be closed
(100000)
- --K
- Kmer size used by internal assembler (default 47).
- --mindepth depth
- Variants (snps and indels) will only be called if there is coverage of
good pairs at this depth or more; if this value is >= 1, it is an
absolute depth, if it is a fraction < 1, then minimum depth is computed
by multiplying this value by the mean coverage for the region, with a
minimum value of 5 (default 0.1: min depth to call is 10% of mean coverage
or 5, whichever is greater).
- --mingap
- Minimum size for unclosed gaps (default 10)
- --minmq
- Minimum alignment mapping quality for a read to count in pileups (default
0)
- --minqual
- Minimum base quality to consider for pileups (default 0)
- --nostrays
- Skip making a pass through the input BAM files to identify stray pairs,
that is, those pairs in which both reads are aligned but not marked valid
because they have inconsistent orientation or separation. Identifying
stray pairs can help fill gaps and assemble larger insertions, especially
of repeat content. However, doing so sometimes consumes considerable
memory.
This manpage was written by Andreas Tille for the Debian
distribution and can be used for any other usage of the program.