obiselect - description of obiselect
obiselect command allows to select a subset of sequences
records from a sequence file by describing sequence record groups and
defining how many and which sequence records from each group must be
retrieved.
In each group as definied by a set of -c options, sequence
records are ordered according to a score function. The N first
sequences (N`is selected using the `-n option) are kept in the result
subset of sequence records.
By default the score function is a random function and one
sequence record is retrieved per group. This leads to select randomly one
sequence per group.
- -c <KEY>,
--category-attribute=<KEY>
Attribute used to categorize the sequence records.
Several
-c options can be combined.
TIP:
The <KEY> can be simply the key of an
attribute, or a Python expression similarly to the -p option of
obigrep.
Example:
> obiselect -c sample -c seq_length seq.fasta
This command select randomly one sequence record per sample and
sequence length from the sequence records included in the seq.fasta
file. The selected sequence records are printed on the screen.
- -n <INTEGER>,
--number=<INTEGER>
Indicates how many sequence records per group have to be
retrieved. If the size of the group is lesser than this NUMBER, the
whole group is retrieved.
Example:
> obiselect -n 2 -c sample -c seq_length seq.fasta
This command has the same effect than the previous example except
that two sequences are retrieved by class of sample/length.
- --merge=<KEY>
- Attribute to merge.
Example:
> obiselect -c seq_length -n 2 -m sample seq1.fasta > seq2.fasta
This command keeps two sequences per sequence length, and records
how many times they were observed for each sample in the new attribute
merged_sample.
- --merge-ids
- Adds a merged attribute containing the list of sequence record ids
merged within this group.
- -m, --min
- Sets the function used for scoring sequence records into a group to the
minimum function. The minimum function is applied to the values used to
define categories (see option -c). Sequences will be ordered
according to the distance of their values to the minimum value.
- -M, --max
- Sets the function used for scoring sequence records into a group to the
maximum function. The maximum function is applied to the values used to
define categories (see option -c). Sequences will be ordered
according to the distance of their values to the maximum value.
- -a, --mean
- Sets the function used for scoring sequence records into a group to the
mean function. The mean function is applied to the values used to define
categories (see option -c). Sequences will be ordered according to
the distance of their values to the mean value.
- --median
- Sets the function used for scoring sequence records into a group to the
median function. The median function is applied to the values used to
define categories (see option -c). Sequences will be ordered
according to the distance of their values to the median value.
- -f FUNCTION,
--function=FUNCTION
- Sets the function used for scoring sequence records into a group to a user
define function. The user define function is declared using Python
syntax. Attribute keys can be used as variables. An extra sequence
variable representing the full sequence record is available. If option for
loading a taxonomy database is provided, a taxonomy variable is
also available. The function is estimated for each sequence record and the
minimum value of this function in each group. Sequences will be ordered in
each group according to the distance of their function estimation to the
minimum value of their group.
- --skip
<N>
- The N first sequence records of the file are discarded from the analysis
and not reported to the output file
- --only
<N>
- Only the N next sequence records of the file are analyzed. The following
sequences in the file are neither analyzed, neither reported to the output
file. This option can be used conjointly with the –skip
option.
- --embl
- Input file is in embl format.
- --fasta
- Input file is in fasta format (including OBITools fasta extensions).
- --sanger
- Input file is in Sanger fastq format (standard fastq used by HiSeq/MiSeq
sequencers).
- --solexa
- Input file is in fastq format produced by Solexa (Ga IIx) sequencers.
- --ecopcr
- Input file is in ecoPCR format.
- --nuc
- Input file contains nucleic sequences.
- --prot
- Input file contains protein sequences.
- --fasta-output
- Output sequences in OBITools fasta format
- --ecopcrdb-output=<PREFIX_FILENAME>
- Creates an ecoPCR database from sequence records results
- --uppercase
- Print sequences in upper case (default is lower case)
- --DEBUG
- Sets logging in debug mode.
- class
- distance
- merged
- class
- merged_*
- select
The OBITools Development Team - LECA
2019 - 2015, OBITool Development Team