MSA_SPLIT(1) | User Commands | MSA_SPLIT(1) |
msa_split - Partitions a multiple sequence alignment either at designated
Partitions a multiple sequence alignment either at designated columns, or according to specified category labels, and outputs sub-alignments for the partitions. Optionally splits an associated annotations file.
(See below for details on options)
1. Read an alignment for a whole human chromosome from a MAF file and extract sub-alignments in 1Mb windows overlapping by 1kb. Use sufficient statistics (SS) format for output (can be used by phyloFit, phastCons, or exoniphy). Set window boundaries between alignment blocks, if possible.
(Windows will be defined using the coordinate system of the first sequence in the alignment, assumed to be the reference sequence; output will be to chr1.1-1000000.ss, chr1.999001-1999000.ss, ...)
2. As in (1), but report unordered sufficient statistics (much more compact and adequate for use with phyloFit).
3. Extract sub-alignments of sites in conserved elements and not in conserved elements, as defined by a BED file (coordinates assumed to be for 1st sequence). Read multiple alignment in FASTA format.
(Output will be to mydata.background-0.fa and mydata.bed_feature-1.fa [latter has sites of category number 1, defined by bed file] 3. Extract sub-alignments of sites in each of the three codon positions, as defined by a GFF file (coordinates assumed to be for 1st sequence). Reverse complement genes on minus strand.
(Output will be to chr22.cds-1.ss, chr22.cds-2.ss, chr22.cds-3.ss)
4. Split an alignment into pieces corresponding to the genes in a GFF file. Assume genes are defined by the tag "transcript_id".
5. Obtain a sub-alignment for each of a set of regulatory regions, as defined in a BED file.
--windows, -w <win_size,win_overlap>
--by-category, -L
--by-group, -P <tag>
--for-features, -F (Requires --features) Extract section of alignment corresponding to every feature. There will be no output for regions not covered by features.
--by-index, -p <indices> List of explicit indices at which to split alignment (comma-separated). If the list of indices is "10,20", then sub-alignments will be output for sites 1-9, 10-19, and 20-<msa_len>. Note that the indices are relative to the input alignment, and not necessarily in genomic coordinates.
--npartitions, -n <number>
--between-blocks, -B <radius> (Not for use with --by-category or --for-features) Try to partition at sites between alignment blocks. Assumes a reference sequence alignment, with the first sequence as the reference seq (as created by multiz). Blocks of 30 sites with gaps in all sequences but the reference seq are assumed to indicate boundaries between alignment blocks. Partition indices will not be moved more than <radius> sites.
--features, -g <fname>
--catmap, -c <fname>|<string> (Optionally use with --by-category) Mapping of feature types to category numbers. Can either give a filename or an "inline" description of a simple category map, e.g., --catmap "NCATS = 3 ; CDS 1-3" or --catmap "NCATS = 1 ; UTR 1".
--refidx, -d <frame_index>
--in-format, -i FASTA|PHYLIP|MPM|MAF|SS Input alignment file format. Default is to guess format from
--refseq, -M <fname>
--out-format, -o FASTA|PHYLIP|MPM|SS Output alignment file format. Default is FASTA.
--out-root, -r <name> Filename root for output files (default "msa_split").
--sub-features, -f (For use with --features) Output subsets of features corresponding to subalignments. Features overlapping partition boundaries will be discarded. Not permitted with
--by-category.
--reverse-compl, -s
--gap-strip, -G ALL|ANY|<seqno>
--seqs, -l <seq_list> Include only specified sequences in output. Indicate by
--exclude, -x Exclude rather than include specified sequences.
--order, -O <name_list>
--min-informative, -I <n>
--do-cats, -C <cat_list> (For use with --by-category) Output sub-alignments for only the specified categories (column-delimited list).
--tuple-size, -T <tuple_size>
--unordered-ss, -z (For use with --out-format SS) Suppress the portion of the sufficient statistics concerned with the order in which columns appear.
--summary, -S
--quiet, -q Proceed quietly.
--help, -h
May 2016 | msa_split 1.4 |