seqkit - cross-platform and ultrafast toolkit for FASTA/Q file
manipulation
SeqKit -- a cross-platform and ultrafast toolkit for
FASTA/Q file manipulation
Version: 2.1.0
Author: Wei Shen <shenwei356@gmail.com>
Documents : http://bioinf.shenwei.me/seqkit Source code:
https://github.com/shenwei356/seqkit Please cite:
https://doi.org/10.1371/journal.pone.0163962
Seqkit utlizies the pgzip (https://github.com/klauspost/pgzip)
package to read and write gzip file, and the outputted gzip file would be
slighty larger than files generated by GNU gzip.
Seqkit writes gzip files very fast, much faster than the
multi-threaded pigz, therefore there's no need to pipe the result to
gzip/pigz.
- amplicon
- extract amplicon (or specific region around it) via primer(s)
- bam
- monitoring and online histograms of BAM record features
- common
- find common sequences of multiple files by id/name/sequence
- concat
- concatenate sequences with same ID from multiple files
- convert
- convert FASTQ quality encoding between Sanger, Solexa and Illumina
- duplicate
- duplicate sequences N times
- faidx
- create FASTA index file and extract subsequence
- fish
- look for short sequences in larger sequences using local alignment
- fq2fa
- convert FASTQ to FASTA
- fx2tab
- convert FASTA/Q to tabular format (and length, GC content, average
quality...)
- genautocomplete generate shell autocompletion script
(bash|zsh|fish|powershell) grep search sequences by
ID/name/sequence/sequence motifs, mismatch allowed head print first N
FASTA/Q records head-genome print sequences of the first genome with
common prefixes in name locate locate subsequences/motifs, mismatch
allowed mutate edit sequence (point mutation, insertion, deletion) pair
match up paired-end reads from two fastq files range print FASTA/Q records
in a range (start:end) rename rename duplicated IDs replace replace
name/sequence by regular expression restart reset start position for
circular genome rmdup remove duplicated sequences by ID/name/sequence
sample sample sequences by number or proportion sana sanitize broken
single line FASTQ files scat real time recursive concatenation and
streaming of fastx files seq transform sequences (extract ID, filter by
length, remove gaps...) shuffle shuffle sequences sliding extract
subsequences in sliding windows sort sort sequences by
id/name/sequence/length split split sequences into files by id/seq
region/size/parts (mainly for FASTA) split2 split sequences into files by
size/parts (FASTA, PE/SE FASTQ) stats simple statistics of FASTA/Q files
subseq get subsequences by region/gtf/bed, including flanking sequences
tab2fx convert tabular format to FASTA/Q format translate translate
DNA/RNA to protein sequence (supporting ambiguous bases) version print
version information and check for update watch monitoring and online
histograms of sequence features
- --alphabet-guess-seq-length
int
- length of sequence prefix of the first FASTA record based on which seqkit
guesses the sequence type (0 for whole seq) (default 10000)
- -h, --help
- help for seqkit
- --id-ncbi
- FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
Pseud...
- --id-regexp
string
- regular expression for parsing ID (default "^(\\S+)\\s?")
- --infile-list
string
- file of input files list (one file per line), if given, they are appended
to files from cli arguments
- -w, --line-width
int
- line width when outputting FASTA format (0 for no wrap) (default 60)
- -o, --out-file
string
- out file ("-" for stdout, suffix .gz for gzipped out) (default
"-")
- --quiet
- be quiet and do not show extra information
- -t, --seq-type
string
- sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically
detect by the first sequence) (default "auto")
- -j, --threads
int
- number of CPUs. can also set with environment variable SEQKIT_THREADS)
(default 4)
Use "seqkit [command] --help" for more
information about a command.
This manpage was written by Nilesh Patra for the Debian
distribution and can be used for any other usage of the program.