meryl - in- and out-of-core kmer counting and utilities
meryl -P -m kmersize [-c
#] [-p] -s seq.fasta
meryl -P -m kmersize [-c
#] [-p] -n mercount
meryl -B -m kmersize [-c
#] [-p] [-v] [-f|-r|-C] [-L
minoccurrence] [-U maxoccurrence] [-threads
n | {-segments segments |
-memory megabytes}
[-configbatch [-sge jobname]]]
-s seq.fasta -o tblprefix
meryl -countbatch number
[-sgebuild "qsuboptionstring"]
-o tblprefix
meryl -mergebatch number
[-sgemerge "qsuboptionstring"]
-o tblprefix
meryl -M operation [-v]
-s tblprefix [-s
tblprefix2 ...] -o output
meryl -Dh -s tblprefix
meryl -Dt -n mincount
-s tblprefix
meryl computes the kmer content of genomic sequences. Kmer
content is represented as a list of kmers and the number of times each
occurs in the input sequences. The kmer can be restricted to only the
forward kmer, only the reverse kmer, or the canonical kmer
(lexicographically smaller of the forward and reverse kmer at each
location). Meryl can report the histogram of counts, the list of
kmers and their counts, or can perform mathematical and set operations on
the processed data files.
The output of meryl is two binary files, called a meryl
database, which can be quickly dumped to provide a histogram of counts, or
the actual counts. A C++ library is supplied for direct access to the
files.
- -P
- Estimate memory requirements. Given a sequence file (-s) or an
upper limit on the number of mers in the file (-n), compute the
table size (-t in build) to minimize the memory usage. This mode
recognizes the following options:
- -m #
- size of a mer (required)
- -c #
- homopolymer compression (optional)
- -p
- enable positions
- -s seq.fasta
- Sequence file to be scanned to determine the number of mers
- -n #
- compute params assuming file with this many mers in it
Only one of -s, -n need to be specified. If both are
given, -s takes priority.
- -B
- Compute the mer-count tables given a sequence file (-s) and lots of
parameters. By default, both strands are processed.
- -f
- only build for the forward strand
- -r
- only build for the reverse strand
- -C
- use canonical mers (assumes both strands)
- -L #
- DON'T save mers that occur less than # times
- -U #
- DON'T save mers that occur more than # times
- -m #
- size of a mer (required)
- -c #
- homopolymer compression (optional)
- -p
- enable positions
- -s seq.fasta
- sequence to build the table for
- -o tblprefix
- output table prefix
- -v
- entertain the user
The meryl process can run in one large memory batch, in
many small memory batches, or under SGE control, all with or without using
multiple CPU cores. By default, the computation is done as one large
sequential process. Multi-threaded operation is possible, at additional
memory expense, as is segmented operation, at additional I/O expense.
- Threaded
operation
- Split the counting in to n almost-equally sized pieces. This uses an extra
h MB (from -P) per thread.
- Segmented,
sequential operation
- Split the counting into pieces that will fit into no more than m MB of
memory, or into n equal sized pieces. Each piece is computed sequentially,
and the results are merged at the end. Only one of -memory and
-segments is needed.
- Segmented,
batched operation
- Same as sequential, except this allows each segment to be manually
executed in parallel. Only one of -memory and -segments is
needed. Also see the EXAMPLE section on this page.
- -M
- Given a list of tables, perform a math, logical or threshold operation.
Unless specified, all operations take any number of databases. Math
operations are:
- min
- count is the minimum count for all databases. If the mer does NOT exist in
all databases, the mer has a zero count, and is NOT in the output.
- minexist
- count is the minimum count for all databases that contain the mer
- max
- count is the maximum count for all databases
- add
- count is sum of the counts for all databases
- sub
- count is the first minus the second (binary only)
- abs
- count is the absolute value of the first minus the second (binary
only)
Logical operations are:
- and
- outputs mer iff it exists in all databases
- nand
- outputs mer iff it exists in at least one, but not all, databases
- or
- outputs mer iff it exists in at least one database
- xor
- outputs mer iff it exists in an odd number of databases
Threshold operations are:
- lessthan x
- outputs mer iff it has count < x
- lessthanorequal x
- outputs mer iff it has count <= x
- greaterthan x
- outputs mer iff it has count > x
- greaterthanorequal x
- outputs mer iff it has count >= x
- equal x
- outputs mer iff it has count == x
Threshold operations work on exactly one database.
- -s tblprefix
- use tblprefix as a database
- -o tblprefix
- create this output
- -v
- entertain the user
- -D
- Dump table (not all of these work)
- -Dd
- Dump a histogram of the distance between the same mers.
- -Dt
- Dump mers >= a threshold. Use -n to specify the threshold.
- -Dc
- Count the number of mers, distinct mers and unique mers.
- -Dh
- Dump (to stdout) a histogram of mer counts.
- -s
- Read the count table from here (leave off the .mcdat or .mcidx).
Initialize the compute with -configbatch, which needs all
the build options. Execute all -countbatch jobs, then
-mergebatch to complete.
meryl -configbatch -B [options] -o file
meryl -countbatch 0 -o file
meryl -countbatch 1 -o file
...
meryl -countbatch N -o file
meryl -mergebatch N -o file