NAME

cdbfasta - Creates an index file for records from a multi-fasta file.

DESCRIPTION

: Creates an index file for records from a multi-fasta file. By default (without -m/-n/-c/-C option), only the first space-delimited token from the defline is used as a key.
: <fastafile> is the multi-fasta file to index; -o the index file will be named <index_file>; if not given,
: the index filename is database name plus the suffix '.cidx'

-r <record_delimiter> a string of characters at the beginning of line

-Q treat input as fastq format, i.e. with '@' as record delimiter

-z database is compressed into the file <compressed_db>

: before indexing (<fastafile> can be "-" or "stdin" in order to get the input records from stdin)

-s strip extraneous characters from *around* the space delimited

: tokens, for the multikey options below (-m,-n,-f); Default <stripendchars> set is: '",`.(){}/[]!:;~|><+-

-m ("multi-key" option) create hash entries pointing to

-n <numkeys> same as -m, but only takes the first <numkeys>

: tokens from the defline; when used with -a option (see below), only collects the first <numkeys> accessions from each defline

-f indexes *space* delimited tokens (fields) in the defline as given

-w <stopwordslist> exclude from indexing all the words found

-i do case insensitive indexing (i.e. create additional keys for

-c for deflines in the format: db1|accession1|db2|accession2|...,

-C like -c, but also subsequent db|accession constructs are indexed,

: along with the full (default) token; additionally, all nrdb concatenated accessions found in the defline are parsed and stored (assuming 0x01 or '^|^' as separators)

-a accession mode: like -C but indexes only the 'accession' part for all

-A like -a and -C together (both accessions and 'db|accession'

-D index each pipe ('|') delimited token found in the record identifier

-d same as -D but using a custom key delimiter <kdelim> instead of the pipe

-G FASTA records are treated as large genomic sequences (e.g. full

: chromosomes/contigs) and their formatting is checked for suitability for fast range queries (i.e. uniform line length within each record)

-v show program version and exit

September 2022

cdbfasta version 1.00