PBSIM(1) | PBSIM(1) |
pbsim - simulator for PacBio sequencing reads
pbsim [options] <reference.fasta>
The pbsim command produces simulated PacBio reads for reference FASTA sequence <reference.fasta>.
Model files (parameters for the --model-qc option) can be found in the /usr/share/pbsim/models directory.
The options for pbsim can be divided into general, sampling-based and model-based simulation options.
--prefix
--data-type
--depth
--length-min
--length-max
--accuracy-min
--accuracy-max
--difference-ratio
--seed
--sample-fastq
--sample-profile-id
--model_qc
--length-mean
--length-sd
--accuracy-mean
--accuracy-sd
To run model-based simulation:
pbsim --data-type CLR \
--depth 20 \
--model_qc /usr/share/pbsim/models/model_qc_clr \
reference.fasta
In the example above, simulated read sequences are randomly sampled from a reference sequence ("reference.fasta") and differences (errors) of the sampled reads are introduced. Data type is CLR, and coverage depth is 20. If the reference sequence is multi-FASTA file, the simulated data is created for each FASTA. Three output files are created for each FASTA. "sd_0001.ref" is a single-FASTA file which is copied from the reference sequence. "sd_0001.fastq" is a simulated read dataset in the FASTQ format. "sd_0001.maf" is a list of alignments between reference sequence and simulated reads in the MAF format. The length and accuracy of reads are simulated based on our model of PacBio read.
To run sampling-based simulation:
pbsim --data-type CLR \
--depth 20 \
--sample-fastq sample.fastq \
reference.fastaq
In the sampling-based simulation, read length and quality score are the same as those of a read taken randomly in the sample PacBio dataset ("sample.fastq").
pbsim is available under the terms of the GNU General Public License, version 2 (GPL-2).
Michiaki Hamada ( <mhamada@k.u-tokyo.ac.jp>), Yukiteru Ono