obiannotate - description of obiannotate
obiannotate is the command that allows
adding/modifying/removing annotation attributes attached to sequence
records.
Once such attributes are added, they can be used by the other
OBITools commands for filtering purposes or for statistics computing.
Example 1:
> obiannotate -S short:'len(sequence)<100' seq1.fasta > seq2.fasta
The above command adds an attribute named short which has a
boolean value indicating whether the sequence length is less than 100bp.
Example 2:
> obiannotate --seq-rank seq1.fasta | \
obiannotate -C --set-identifier '"'FungA'_%05d" % seq_rank' \
> seq2.fasta
The above command adds a new attribute whose value is the sequence
record entry number in the file. Then it clears all the sequence record
attributes and sets the identifier to a string beginning with FungA_
followed by a suffix with 5 digits containing the sequence entry number.
Example 3:
> obiannotate -d my_ecopcr_database \
--with-taxon-at-rank=genus seq1.fasta > seq2.fasta
The above command adds taxonomic information at the genus
rank to the sequence records.
Example 4:
> obiannotate -S 'new_seq:str(sequence).replace("a","t")' \
seq1.fasta | obiannotate --set-sequence new_seq > seq2.fasta
The overall aim of the above command is to edit the
sequence object itself, by replacing all nucleotides a by
nucleotides t. First, a new attribute named new_seq is
created, which contains the modified sequence, and then the former sequence
is replaced by the modified one.
- --seq-rank
- Adds a new attribute named seq_rank to the sequence record
indicating its entry number in the sequence file.
- --delete-tag=<KEY>
- Deletes attribute named <ATTRIBUTE_NAME>.When this attribute is
missing, the sequence record is skipped and the next one is examined.
- --tag-list=<FILENAME>
- <FILENAME> points to a file containing attribute names and values to
modify for specified sequence records.
- -C, --clear
- Clears all attributes associated to the sequence records.
- --length
- Adds attribute with seq_length as a key and sequence length as a
value.
- -m <MCLFILE>,
--mcl=<MCLFILE>
- Creates a new attribute containing the number of the cluster the sequence
record was assigned to, as indicated in file <MCLFILE>.
- --uniq-id
- Forces sequence record ids to be unique.
- -s <REGULAR_PATTERN>,
--sequence=<REGULAR_PATTERN>
Regular expression pattern to be tested against the
sequence itself. The pattern is case insensitive.
Examples:
> obigrep -s 'GAATTC' seq1.fasta > seq2.fasta
Selects only the sequence records that contain an EcoRI
restriction site.
> obigrep -s 'A{10,}' seq1.fasta > seq2.fasta
Selects only the sequence records that contain a stretch of at
least 10 A.
> obigrep -s '^[ACGT]+$' seq1.fasta > seq2.fasta
Selects only the sequence records that do not contain ambiguous
nucleotides.
- -D <REGULAR_PATTERN>,
--definition=<REGULAR_PATTERN>
Regular expression pattern to be tested against the
definition of the sequence record. The pattern is case sensitive.
Example:
> obigrep -D '[Cc]hloroplast' seq1.fasta > seq2.fasta
Selects only the sequence records whose definition contains
chloroplast or Chloroplast.
- -I <REGULAR_PATTERN>,
--identifier=<REGULAR_PATTERN>
Regular expression pattern to be tested against the
identifier of the sequence record. The pattern is case sensitive.
Example:
> obigrep -I '^GH' seq1.fasta > seq2.fasta
Selects only the sequence records whose identifier begins with
GH.
- --id-list=<FILENAME>
<FILENAME> points to a text file containing
the list of sequence record identifiers to be selected. The file format
consists in a single identifier per line.
Example:
> obigrep --id-list=my_id_list.txt seq1.fasta > seq2.fasta
Selects only the sequence records whose identifier is present in
the my_id_list.txt file.
- --attribute=<KEY>:<REGULAR_PATTERN>
Regular expression pattern matched against the attributes
of the sequence record. the value of this attribute is of the form :
key:regular_pattern. The pattern is case sensitive. Several -a options
can be used on the same command line and in this last case, the selected
sequence records will match all constraints.
Example:
> obigrep -a 'family_name:Asteraceae' seq1.fasta > seq2.fasta
Selects the sequence records containing an attribute whose key is
family_name and value is Asteraceae.
- -A <ATTRIBUTE_NAME>,
--has-attribute=<KEY>
Selects sequence records having an attribute whose key =
<KEY>.
Example:
> obigrep -A taxid seq1.fasta > seq2.fasta
Selects only the sequence records having a taxid attribute
defined.
- -p <PYTHON_EXPRESSION>,
--predicat=<PYTHON_EXPRESSION>
Python boolean expression to be evaluated for each
sequence record. The attribute keys defined for each sequence record can be
used in the expression as variable names. An extra variable named
‘sequence’ refers to the sequence record itself. Several -p
options can be used on the same command line and in this last case, the
selected sequence records will match all constraints.
Example:
> obigrep -p '(forward_error<2) and (reverse_error<2)' \
seq1.fasta > seq2.fasta
Selects only the sequence records whose forward_error and
reverse_error attributes have a value smaller than two.
- -L <##>,
--lmax=<##>
Keeps sequence records whose sequence length is equal or
shorter than lmax.
Example:
> obigrep -L 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length
equal or shorter than 100bp.
- -l <##>,
--lmin=<##>
Selects sequence records whose sequence length is equal
or longer than lmin.
Examples:
> obigrep -l 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length
equal or longer than 100bp.
- -v,
--inverse-match
Inverts the sequence record selection.
Examples:
> obigrep -v -l 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length
shorter than 100bp.
- --skip
<N>
- The N first sequence records of the file are discarded from the analysis
and not reported to the output file
- --only
<N>
- Only the N next sequence records of the file are analyzed. The following
sequences in the file are neither analyzed, neither reported to the output
file. This option can be used conjointly with the –skip
option.
- --embl
- Input file is in embl format.
- --fasta
- Input file is in fasta format (including OBITools fasta extensions).
- --sanger
- Input file is in Sanger fastq format (standard fastq used by HiSeq/MiSeq
sequencers).
- --solexa
- Input file is in fastq format produced by Solexa (Ga IIx) sequencers.
- --ecopcr
- Input file is in ecoPCR format.
- --nuc
- Input file contains nucleic sequences.
- --prot
- Input file contains protein sequences.
- --fasta-output
- Output sequences in OBITools fasta format
- --ecopcrdb-output=<PREFIX_FILENAME>
- Creates an ecoPCR database from sequence records results
- --uppercase
- Print sequences in upper case (default is lower case)
- --DEBUG
- Sets logging in debug mode.
- seq_length
- seq_rank
- cluster
- scientific_name
- taxid
- rank
- family
- family_name
- genus
- genus_name
- order
- order_name
- species
- species_name
The OBITools Development Team - LECA
2019 - 2015, OBITool Development Team