dnaindex - index dna file for use with ANFO
dnaindex builds an index for a dna file. Dna files must be
indexed to be useable with anfo(1), it is possible to have multiple
indices for the same dna file.
- -V, --version
- Print version number and exit.
- -o file, --output file
- Write output to file. file customarily ends in .idx.
Default is genomename_wordsize.idx.
- -g file, --genome file
- Read the genome from file. This file name is also stored in the
resulting index so it can be found automatically whenever the index is
used. It is therefore best if file is just a file name without
path.
- -G dir, --genome-dir dir
- Add dir to the genome search path. This is useful if the genome to
be indexed is not yet in the place where it will later be used.
- -d text, --description text
- Add text as description to the index. This is purely informative.
- -s size, --wordsize size
- Set the wordsize to size. A smaller wordsize increases precision at
the expense of higher computational investment. The default is 12, which
with a stride of 8 yields a good compromise.
- -S num, --stride num
- Set the stride to num. Only one out of num possible words of
dna is actually indexed. A smaller stride increases precicion at the
expense of a bigger index. The default is 8, which in conjunction with a
wordsize of 12 yields a good compromise.
- -l lim, --limit lim
- Prevents the indexing of words that occur more often than lim
times. This can be used to ignore repetitive seeds and save the space to
store them. A good default depends on the size of the genome being
indexed, something like 500 works for the human genome with wordsize 12
and stride 8.
- -h, --histogram
- Produce a histogram of word frequencies. This can be used to get an indea
how the frequency distribution to select an appropriate value for
--limit.
- -v, --verbose
- Print a progress indicator during operation.
dnaindex is limited to genomes no longer than 4 gigabases
due to its use of 32 bit indices. The index is quite large, so depending on
parameters, a 64 bit platform is needed for genomes in the gigabase
range.
If a genome contains IUPAC ambiguity codes, the affected seeds
need to be expanded. If there are many ambiguity codes in a small region,
that results in an unacceptably large index.
- ANFO_PATH
- Colon separated list of directories searched for genome files.
/etc/popt
The system wide configuration file for
popt(3).
dnaindex identifies itself as "dnaindex" to popt.
~/.popt
Per user configuration file for
popt(3).
Udo Stenzel <udo_stenzel@eva.mpg.de>