sumaclust - star clustering of genetic sequences
sumaclust [options] <dataset>
With the development of next-generation sequencing, efficient
tools are needed to handle millions of sequences in reasonable amounts of
time. Sumaclust is a program developed by the LECA. Sumaclust aims to
cluster sequences in a way that is fast and exact at the same time. This
tool has been developed to be adapted to the type of data generated by DNA
metabarcoding, i.e. entirely sequenced, short markers. Sumaclust clusters
sequences using the same clustering algorithm as UCLUST and CD- HIT. This
algorithm is mainly useful to detect the 'erroneous' sequences created
during amplification and sequencing protocols, deriving from 'true'
sequences.
- -h
- [H]elp - print <this> help
- -l
- : Reference sequence length is the shortest.
- -L
- Reference sequence length is the largest.
- -a
- Reference sequence length is the alignment length (default).
- -n
- Score is normalized by reference sequence length (default).
- -r
- : Raw score, not normalized.
- -d
- : Score is expressed in distance (default : score is expressed in
similarity).
-t ##.## : Score threshold for clustering. If the score
is normalized and expressed in similarity (default),
- it is an identity, e.g. 0.95 for an identity of 95%. If the score is
normalized and expressed in distance, it is (1.0 - identity), e.g. 0.05
for an identity of 95%. If the score is not normalized and expressed in
similarity, it is the length of the Longest Common Subsequence. If the
score is not normalized and expressed in distance, it is (reference length
- LCS length). Only sequences with a similarity above ##.## with the
center sequence of a cluster are assigned to that cluster. Default:
0.97.
- -e
- Exact option: A sequence is assigned to the cluster with the center
sequence presenting the highest similarity score > threshold, as
opposed to the default 'fast' option where a sequence is assigned to the
first cluster found with a center sequence presenting a score >
threshold.
- -R ##
- Maximum ratio between the counts of two sequences so that the less
abundant one can be considered as a variant of the more abundant one.
Default: 1.0.
- -p ##
- Multithreading with ## threads using openMP.
- -s ####
- Sorting by ####. Must be 'None' for no sorting, or a key in the fasta
header of each sequence, except for the count that can be computed
(default : sorting by count).
- -o
- Sorting is in ascending order (default : descending).
- -g
- n's are replaced with a's (default: sequences with n's are
discarded).
- -B ###
- Output of the OTU table in BIOM format is activated, and written to file
###.
- -O ###
- Output of the OTU map (observation map) is activated, and written to file
###.
- -F ###
- Output in FASTA format is written to file ### instead of standard
output.
- -f
- Output in FASTA format is deactivated.
Argument : the nucleotide dataset to cluster
http://metabarcoding.org/sumatra