CollapseSeq.py - emoves duplicate sequences from FASTA/FASTQ
files
usage: CollapseSeq.py [--version] [-h] -s SEQ_FILES
[SEQ_FILES ...]
- [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
- [--outname OUT_NAME] [--log LOG_FILE] [--failed] [--fasta] [--delim
DELIMITER DELIMITER DELIMITER] [-n MAX_MISSING] [--uf UNIQ_FIELDS
[UNIQ_FIELDS ...]] [--cf COPY_FIELDS [COPY_FIELDS ...]] [--act
{min,max,sum,set} [{min,max,sum,set} ...]] [--inner] [--keepmiss] [--maxf
MAX_FIELD | --minf MIN_FIELD]
Removes duplicate sequences from FASTA/FASTQ files
- -s SEQ_FILES [SEQ_FILES
...]
- A list of FASTA/FASTQ files containing sequences to process. (default:
None)
- -o OUT_FILES [OUT_FILES
...]
- Explicit output file name(s). Note, this argument cannot be used with the
--failed, --outdir, or --outname arguments. If
unspecified, then the output filename will be based on the input
filename(s). (default: None)
- --outdir
OUT_DIR
- Specify to changes the output directory to the location specified. The
input file directory is used if this is not specified. (default:
None)
- --outname
OUT_NAME
- Changes the prefix of the successfully processed output file to the string
specified. May not be specified with multiple input files. (default:
None)
- --log LOG_FILE
- Specify to write verbose logging to a file. May not be specified with
multiple input files. (default: None)
- --failed
- If specified create files containing records that fail processing.
(default: False)
- --fasta
- Specify to force output as FASTA rather than FASTQ. (default: None)
- --delim DELIMITER
DELIMITER DELIMITER
- A list of the three delimiters that separate annotation blocks, field
names and values, and values within a field, respectively. (default: ('|',
'=', ','))
- -n MAX_MISSING
- Maximum number of missing nucleotides to consider for collapsing
sequences. A sequence will be considered undetermined if it contains too
many missing nucleotides. (default: 0)
- --uf UNIQ_FIELDS
[UNIQ_FIELDS ...]
- Specifies a set of annotation fields that must match for sequences to be
considered duplicates. (default: None)
- --cf COPY_FIELDS
[COPY_FIELDS ...]
- Specifies a set of annotation fields to copy into the unique sequence
output. (default: None)
- --act {min,max,sum,set}
[{min,max,sum,set} ...]
- List of actions to take for each copy field which defines how each
annotation will be combined into a single value. The actions
"min", "max", "sum" perform the
corresponding mathematical operation on numeric annotations. The action
"set" collapses annotations into a comma delimited list of
unique values. (default: None)
- --inner
- If specified, exclude consecutive missing characters at either end of the
sequence. (default: False)
- --keepmiss
- If specified, sequences with more missing characters than the threshold
set by the -n parameter will be written to the unique sequence
output file with a DUPCOUNT=1 annotation. If not specified, such sequences
will be written to a separate file. (default: False)
- --maxf
MAX_FIELD
- Specify the field whose maximum value determines the retained sequence;
mutually exclusive with --minf. (default: None)
- --minf
MIN_FIELD
- Specify the field whose minimum value determines the retained sequence;
mutually exclusive with --minf. (default: None)
- collapse-unique
- unique sequences. Contains one representative from each set of duplicate
sequences. The retained representative is determined by user defined
criteria.
- collapse-duplicate
- raw reads which are duplicates of the sequences retained in the
collapse-unique file.
- collapse-undetermined
- raw reads which were excluded from consideration due to having too many N
characters in the sequence.
- DUPCOUNT
- total number of sequences within the set of duplicates for each retained
unique sequence. Meaning, the copy number of each unique sequence within
the data file.
- <user defined>
- annotation fields specified by the --cf parameter.
This manpage was written by Andreas Tille for the Debian
distribution and
can be used for any other usage of the program.