unikmer - Toolkit for nucleic acid k-mer analysis
unikmer - Toolkit for k-mer with taxonomic information
unikmer is a toolkit for nucleic acid k-mer analysis, providing
functions including set operation on k-mers optional with TaxIds but without
count information.
K-mers are either encoded (k<=32) or hashed (arbitrary k) into
'uint64', and serialized in binary file with extension '.unik'.
TaxIds can be assigned when counting k-mers from genome sequences,
and LCA (Lowest Common Ancestor) is computed during set opertions including
computing union, intersection, set difference, unique and repeated
k-mers.
Version: v0.19.0
Author: Wei Shen <shenwei356@gmail.com>
Documents : https://bioinf.shenwei.me/unikmer Source code:
https://github.com/shenwei356/unikmer
Dataset (optional):
- Manipulating k-mers with TaxIds needs taxonomy file from e.g., NCBI
Taxonomy database, please extract "nodes.dmp",
"names.dmp", "delnodes.dmp" and "merged.dmp"
from link below into ~/.unikmer/ ,
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz , or some other
directory, and later you can refer to using flag --data-dir or
environment variable UNIKMER_DB.
- For GTDB, use 'taxonkit create-taxdump' to create NCBI-style taxonomy dump
files, or download from:
- https://github.com/shenwei356/gtdb-taxonomy
- Note that TaxIds are represented using uint32 and stored in 4 or less
bytes, all TaxIds should be in the range of [1, 4294967295]
- autocompletion Generate shell autocompletion script
(bash|zsh|fish|powershell) common Find k-mers shared by most of multiple
binary files concat Concatenate multiple binary files without removing
duplicates count Generate k-mers (sketch) from FASTA/Q sequences decode
Decode encoded integer to k-mer text diff Set difference of multiple
binary files dump Convert plain k-mer text to binary format encode Encode
plain k-mer text to integer filter Filter out low-complexity k-mers
(experimental) grep Search k-mers from binary files head Extract the first
N k-mers info Information of binary files inter Intersection of multiple
binary files locate Locate k-mers in genome merge Merge k-mers from sorted
chunk files num Quickly inspect number of k-mers in binary files rfilter
Filter k-mers by taxonomic rank sample Sample k-mers from binary files
sort Sort k-mers in binary files to reduce file size split Split k-mers
into sorted chunk files tsplit Split k-mers according to taxid union Union
of multiple binary files uniqs Mapping k-mers back to genome and find
unique subsequences version Print version information and check for update
view Read and output binary format to plain text
- -c, --compact
- write compact binary file with little loss of speed
- --compression-level
int
- compression level (default -1)
- --data-dir
string
- directory containing NCBI Taxonomy files, including nodes.dmp, names.dmp,
merged.dmp and delnodes.dmp (default
"/home/nilesh/.unikmer")
- -h, --help
- help for unikmer
- -I,
--ignore-taxid
- ignore taxonomy information
- -i, --infile-list
string
- file of input files list (one file per line), if given, they are appended
to files from cli arguments
- --max-taxid
uint32
- for smaller TaxIds, we can use less space to store TaxIds. default value
is 1<<32-1, that's enough for NCBI Taxonomy TaxIds (default
4294967295)
- -C,
--no-compress
- do not compress binary file (not recommended)
- --nocheck-file
- do not check binary file, when using process substitution or named
pipe
- -j, --threads
int
- number of CPUs to use (default 4)
- --verbose
- print verbose information
Use "unikmer [command] --help" for more
information about a command.