obisample - description of obisample
obisample randomly resamples sequence records with or
without replacement.
- -s ###, --sample-size
###
Specifies the size of the generated sample.
- without the -a option, sample size is expressed as the exact number
of sequence records to be sampled (default: number of sequence records in
the input file).
- with the -a option, sample size is expressed as a fraction of the
sequence record numbers in the input file (expressed as a number between 0
and 1).
Example:
> obisample -s 1000 seq1.fasta > seq2.fasta
Samples randomly 1000 sequence records from the seq1.fasta
file, with replacement, and saves them in the seq2.fasta file.
- -a,
--approx-sampling
Switches the resampling algorithm to an approximative
one, useful for large files.
The default algorithm selects exactly the number of sequence
records specified with the -s option. When the -a option is
set, each sequence record has a probability to be selected related to the
count attribute of the sequence record and the -s
fraction.
Example:
> obisample -s 0.5 -a seq1.fastq > seq2.fastq
Samples randomly half of the sequence records of the
seq1.fastq file, without replacement, and saves them in the
seq2.fastq file.
- -w,
--without-replacement
Asks for sampling without replacement.
Example:
> obisample -s 1000 -w seq1.fasta > seq2.fasta
Samples randomly 1000 sequence records from the seq1.fasta
file, without replacement (the input file must contain at least 1000
sequences), and saves them in the seq2.fasta file.
- --skip
<N>
- The N first sequence records of the file are discarded from the analysis
and not reported to the output file
- --only
<N>
- Only the N next sequence records of the file are analyzed. The following
sequences in the file are neither analyzed, neither reported to the output
file. This option can be used conjointly with the –skip
option.
- --embl
- Input file is in embl format.
- --fasta
- Input file is in fasta format (including OBITools fasta extensions).
- --sanger
- Input file is in Sanger fastq format (standard fastq used by HiSeq/MiSeq
sequencers).
- --solexa
- Input file is in fastq format produced by Solexa (Ga IIx) sequencers.
- --ecopcr
- Input file is in ecoPCR format.
- --nuc
- Input file contains nucleic sequences.
- --prot
- Input file contains protein sequences.
- --DEBUG
- Sets logging in debug mode.
The OBITools Development Team - LECA
2019 - 2015, OBITool Development Team