- cdbfasta <fastafile> [-o <index_file>] [-r
<record_delimiter>]
- [-z <compressed_db>] [-i] [-m|-n
<numkeys>|-f<LIST>]|-c|-C]
- [-w <stopwords_list>] [-s <stripendchars>] [{-Q|-G}]
- [-v]
- Creates an index file for records from a multi-fasta file. By default
(without -m/-n/-c/-C option), only the first space-delimited token
from the defline is used as a key.
- <fastafile> is the multi-fasta file to index; -o the index
file will be named <index_file>; if not given,
- the index filename is database name plus the suffix '.cidx'
-r <record_delimiter> a string of characters at
the beginning of line
- marking the start of a record (default: '>')
-Q treat input as fastq format, i.e. with '@' as record
delimiter
- and with records expected to have at least 4 lines
-z database is compressed into the file
<compressed_db>
- before indexing (<fastafile> can be "-" or
"stdin" in order to get the input records from stdin)
-s strip extraneous characters from *around* the space
delimited
- tokens, for the multikey options below (-m,-n,-f); Default
<stripendchars> set is: '",`.(){}/[]!:;~|><+-
-m ("multi-key" option) create hash entries
pointing to
- the same record for all tokens found in the defline
-n <numkeys> same as -m, but only takes the
first <numkeys>
- tokens from the defline; when used with -a option (see below), only
collects the first <numkeys> accessions from each defline
-f indexes *space* delimited tokens (fields) in the
defline as given
- by LIST of fields or fields ranges (the same syntax as UNIX 'cut')
-w <stopwordslist> exclude from indexing all the
words found
- in the file <stopwordslist> (for options -m, -n and
-k)
-i do case insensitive indexing (i.e. create additional
keys for
- all-lowercase tokens used for indexing from the defline
-c for deflines in the format:
db1|accession1|db2|accession2|...,
- only the first db-accession pair ('db1|accession1') is taken as key
-C like -c, but also subsequent db|accession
constructs are indexed,
- along with the full (default) token; additionally, all nrdb concatenated
accessions found in the defline are parsed and stored (assuming 0x01 or
'^|^' as separators)
-a accession mode: like -C but indexes only the
'accession' part for all
- 'db|accession' constructs found, plus the default first tokens
-A like -a and -C together (both
accessions and 'db|accession'
- constructs are used as keys
-D index each pipe ('|') delimited token found in the
record identifier
- (e.g. >key1|key2|key3|.. )
-d same as -D but using a custom key delimiter
<kdelim> instead of the pipe
- character '|'
-G FASTA records are treated as large genomic sequences
(e.g. full
- chromosomes/contigs) and their formatting is checked for suitability for
fast range queries (i.e. uniform line length within each record)
-v show program version and exit
-