DOKK / manpages / debian 12 / presto / CollapseSeq.py.1.en
COLLAPSESEQ.PY(1) User Commands COLLAPSESEQ.PY(1)

CollapseSeq.py - emoves duplicate sequences from FASTA/FASTQ files

usage: CollapseSeq.py [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]

[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed] [--fasta] [--delim DELIMITER DELIMITER DELIMITER] [-n MAX_MISSING] [--uf UNIQ_FIELDS [UNIQ_FIELDS ...]] [--cf COPY_FIELDS [COPY_FIELDS ...]] [--act {min,max,sum,set} [{min,max,sum,set} ...]] [--inner] [--keepmiss] [--maxf MAX_FIELD | --minf MIN_FIELD]

Removes duplicate sequences from FASTA/FASTQ files

show program's version number and exit
show this help message and exit

A list of FASTA/FASTQ files containing sequences to process. (default: None)
Explicit output file name(s). Note, this argument cannot be used with the --failed, --outdir, or --outname arguments. If unspecified, then the output filename will be based on the input filename(s). (default: None)
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified. (default: None)
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files. (default: None)
Specify to write verbose logging to a file. May not be specified with multiple input files. (default: None)
If specified create files containing records that fail processing. (default: False)
Specify to force output as FASTA rather than FASTQ. (default: None)
A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively. (default: ('|', '=', ','))

Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides. (default: 0)
Specifies a set of annotation fields that must match for sequences to be considered duplicates. (default: None)
Specifies a set of annotation fields to copy into the unique sequence output. (default: None)
List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions "min", "max", "sum" perform the corresponding mathematical operation on numeric annotations. The action "set" collapses annotations into a comma delimited list of unique values. (default: None)
If specified, exclude consecutive missing characters at either end of the sequence. (default: False)
If specified, sequences with more missing characters than the threshold set by the -n parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file. (default: False)
Specify the field whose maximum value determines the retained sequence; mutually exclusive with --minf. (default: None)
Specify the field whose minimum value determines the retained sequence; mutually exclusive with --minf. (default: None)

collapse-unique
unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria.
collapse-duplicate
raw reads which are duplicates of the sequences retained in the collapse-unique file.
collapse-undetermined
raw reads which were excluded from consideration due to having too many N characters in the sequence.

DUPCOUNT
total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file.
<user defined>
annotation fields specified by the --cf parameter.


This manpage was written by Andreas Tille for the Debian distribution and
can be used for any other usage of the program.

May 2020 CollapseSeq.py 0.6.0