ecotag - description of ecotag
ecotag is the tool that assigns sequences to a taxon based
on sequence similarity. The program first searches the reference database
for the reference sequence(s) (hereafter referred to as ‘primary
reference sequence(s)’) showing the highest similarity with the query
sequence. Then it looks for all other reference sequences (hereafter
referred to as ‘secondary reference sequences’) whose
similarity with the primary reference sequence(s) is equal or higher than
the similarity between the primary reference and the query sequences.
Finally, it assigns the query sequence to the most recent common ancestor of
the primary and secondary reference sequences.
As input, ecotag requires the sequences to be assigned, a
reference database in fasta format, where each sequence is associated with a
taxon identified by a unique taxid, and a taxonomy database where
taxonomic information is stored for each taxid.
Example:
> ecotag -d embl_r113 -R ReferenceDB.fasta \
--sort=count -m 0.95 -r seq.fasta > seq_tag.fasta
The above command specifies that each sequence stored in
seq.fasta is compared to those in the reference database called
ReferenceDB.fasta for taxonomic assignment. In the output file
seq_tag.fasta, the sequences are sorted from highest to lowest
counts. When there is no reference sequence with a similarity equal or
higher than 0.95 for a given sequence, no taxonomic information is provided
for this sequence in seq_tag.fasta.
- -m FLOAT,
--minimum-identity=FLOAT
- When the best match with the reference database present an identity level
below FLOAT, the taxonomic assignment for the sequence record is not
computed. The sequence record is nevertheless included in the output file.
FLOAT is included in a [0,1] interval.
- --minimum-circle=FLOAT
- minimum identity considered for the assignment circle. FLOAT is included
in a [0,1] interval.
- -u, --uniq
- When this option is specified, the program first dereplicates the sequence
records to work on unique sequences only. This option greatly improves the
program’s speed, especially for highly redundant datasets.
- --sort=<KEY>
- The output is sorted based on the values of the relevant attribute.
- -r, --reverse
- The output is sorted in reverse order (should be used with the
–sort option). (Works even if the –sort option is not set,
but could not find on what the output is sorted).
- -E FLOAT,
--errors=FLOAT
- FLOAT is the fraction of reference sequences that will be ignored when
looking for the lowest common ancestor. This option is useful when a
non-negligible proportion of reference sequences is expected to be
assigned to the wrong taxon, for example because of taxonomic
misidentification. FLOAT is included in a [0,1] interval.
- -M INTEGER,
--min-matches=FLOAT
- Define the minimum congruent assignation. If this minimum is reached and
the -E option is activated, the lowest common ancestor algorithm tolarated
that some sequences do not provide the same taxonomic annotation (see the
-E option).
- --cache-size=INTEGER
- A cache for computed similarities is maintained by ecotag. the
default size for this cache is 1,000,000 of scores. This option allows to
change the cache size.
- --skip
<N>
- The N first sequence records of the file are discarded from the analysis
and not reported to the output file
- --only
<N>
- Only the N next sequence records of the file are analyzed. The following
sequences in the file are neither analyzed, neither reported to the output
file. This option can be used conjointly with the –skip
option.
- --embl
- Input file is in embl format.
- --fasta
- Input file is in fasta format (including OBITools fasta extensions).
- --sanger
- Input file is in Sanger fastq format (standard fastq used by HiSeq/MiSeq
sequencers).
- --solexa
- Input file is in fastq format produced by Solexa (Ga IIx) sequencers.
- --ecopcr
- Input file is in ecoPCR format.
- --nuc
- Input file contains nucleic sequences.
- --prot
- Input file contains protein sequences.
- --fasta-output
- Output sequences in OBITools fasta format
- --ecopcrdb-output=<PREFIX_FILENAME>
- Creates an ecoPCR database from sequence records results
- --uppercase
- Print sequences in upper case (default is lower case)
- --DEBUG
- Sets logging in debug mode.
- best_identity
- best_match
- family
- family_name
- genus
- genus_name
- id_status
- order
- order_name
- rank
- scientific_name
- species
- species_list
- species_name
- taxid
The OBITools Development Team - LECA
2019 - 2015, OBITool Development Team