obiaddtaxids - description of obiaddtaxids
The obiaddtaxids command annotates sequence records with a
taxid based on a taxon scientific name stored in the sequence record
header.
Taxonomic information linking a taxid to a taxon scientific
name is stored in a database formatted as an ecoPCR database (see
obitaxonomy) or a NCBI taxdump (see NCBI ftp site).
The way to extract the taxon scientific name from the sequence
record header can be specified by two options:
- By default, the sequence identifier is used. Underscore characters
(_) are substituted by spaces before looking for the taxon
scientific name into the taxonomic database.
- If the input file is an OBITools extended fasta format, the
-k option specifies the attribute containing the taxon scientific
name.
- If the input file is a fasta file imported from the UNITE or from the
SILVA web sites, the -f option allows specifying this source and
parsing correctly the associated taxonomic information.
For each sequence record, obiaddtaxids tries to match the
extracted taxon scientific name with those stored in the taxonomic
database.
- •
- If a match is found, the sequence record is annotated with the
corresponding taxid.
Otherwise,
- If the -g option is set and the taxon name is composed of two words
and only the first one is found in the taxonomic database at the
‘genus’ rank, obiaddtaxids considers that it found
the genus associated with this sequence record and it stores this sequence
record in the file specified by the -g option.
- If the -u option is set and no taxonomic information is retrieved
from the scientific taxon name, the sequence record is stored in the file
specified by the -u option.
Example
> obiaddtaxids -k species_name -g genus_identified.fasta \
-u unidentified.fasta -d my_ecopcr_database \
my_sequences.fasta > identified.fasta
Tries to match the value associated with the species_name
key of each sequence record from the my_sequences.fasta file with a
taxon name from the ecoPCR database my_ecopcr_database.
- If there is an exact match, the sequence record is stored in the
identified.fasta file.
- If not and the species_name value is composed of two words,
obiaddtaxids considers the first word as a genus name and tries to
find it into the taxonomic database.
- If a genus is found, the sequence record is stored in the
genus_identified.fasta file.
- Otherwise the sequence record is stored in the unidentified.fasta
file.
- -f <FORMAT>,
--format=<FORMAT>
- Format of the sequence file. Possible formats are:
- raw: for regular OBITools extended fasta files (default
value).
- UNITE: for fasta files downloaded from the UNITE web
site.
- SILVA: for fasta files downloaded from the SILVA web
site.
- -k <KEY>,
--key-name=<KEY>
- Key of the attribute containing the taxon name in sequence files in the
OBITools extended fasta format.
- -a <ANCESTOR>,
--restricting_ancestor=<ANCESTOR>
- Enables to restrict the search of taxids under a specified
ancestor.
<ANCESTOR> can be a taxid (integer) or a
key (string).
- If it is a taxid, this taxid is used to restrict the search
for all the sequence records.
- If it is a key, obiaddtaxids looks for the ancestor taxid in
the corresponding attribute. This allows having a different ancestor
restriction for each sequence record.
- --DEBUG
- Sets logging in debug mode.
The OBITools Development Team - LECA
2019 - 2015, OBITool Development Team