NCBI-SEG(1) | User Commands | NCBI-SEG(1) |
ncbi-seg - segment sequence(s) by local complexity
ncbi-seg sequence [ W ] [ K(1) ] [ K(2) ] [ -x ] [ options ]
ncbi-seg divides sequences into contrasting segments of low-complexity and high-complexity. Low-complexity segments defined by the algorithm represent "simple sequences" or "compositionally-biased regions".
Locally-optimized low-complexity segments are produced at defined levels of stringency, based on formal definitions of local compositional complexity (Wootton & Federhen, 1993). The segment lengths and the number of segments per sequence are determined automatically by the algorithm.
The input is a FASTA-formatted sequence file, or a database file containing many FASTA-formatted sequences. ncbi-seg is tuned for amino acid sequences. For nucleotide sequences, see EXAMPLES OF PARAMETER SETS below.
The stringency of the search for low-complexity segments is determined by three user-defined parameters, trigger window length [ W ], trigger complexity [ K(1) ] and extension complexity [ K(2)] (see below under PARAMETERS ). The defaults provided are suitable for low-complexity masking of database search query sequences [ -x option required, see below].
(1) Readable segmented sequence [Default]. Regions of contrasting complexity are displayed in "tree format". See EXAMPLES.
(2) Low-complexity masking (see Altschul et al, 1994). Produce a masked FASTA-formatted file, ready for input as a query sequence for database search programs such as BLAST or FASTA. The amino acids in low-complexity regions are replaced with "x" characters [-x option]. See EXAMPLES.
(3) Database construction. Produce FASTA-formatted files containing low-complexity segments [-l option], or high-complexity segments [-h option], or both [-a option]. Each segment is a separate sequence entry with an informative header line.
The SEG algorithm has two stages. First, identification of approximate raw segments of low- complexity; second local optimization.
At the first stage, the stringency and resolution of the search for low-complexity segments is determined by the W, K(1) and K(2) parameters. All trigger windows are defined, including overlapping windows, of length W and complexity less than or equal to K(1). "Complexity" here is defined by equation (3) of Wootton & Federhen (1993). Each trigger window is then extended into a contig in both directions by merging with extension windows, which are overlapping windows of length W and complexity less than or equal to K(2). Each contig is a raw segment.
At the second stage, each raw segment is reduced to a single optimal low-complexity segment, which may be the entire raw segment but is usually a subsequence. The optimal subsequence has the lowest value of the probability P(0) (equation (5) of Wootton & Federhen, 1993).
These three numeric parameters are in obligatory order after the sequence file name.
Trigger window length [ W ]. An integer greater than zero [ Default 12 ].
Trigger complexity. [ K1 ]. The maximum complexity of a trigger window in units of bits. K1 must be equal to or greater than zero. The maximum value is 4.322 (log[base 2]20) for amino acid sequences [ Default 2.2 ].
Extension complexity [ K2 ]. The maximum complexity of an extension window in units of bits. Only values greater than K1 are effective in extending triggered windows. Range of possible values is as for K1 [ Default 2.5 ].
The following options may be placed in any order in the command line after the W, K1 and K2 parameters:
Default parameters are given by 'ncbi-seg sequence' (equivalent to 'ncbi-seg sequence 12 2.2 2.5'). These parameters are appropriate for low- complexity masking of many amino acid sequences [with -x option ].
More stringent (lower) complexity parameters are suitable when masked sequences are compared with masked sequences. For example, for BLAST or FASTA searches that compare two amino acid sequence databases, the following masking may be applied to both databases:
ncbi-seg database 12 1.8 2.0 -x
To examine all homopolymeric subsequences of length (for example) 7 or greater:
ncbi-seg sequence 7 0 0
Many long non-globular domains may be diagnosed at longer window lengths, typically:
ncbi-seg sequence 45 3.4 3.75
For some shorter non-globular domains, the following set is appropriate:
ncbi-seg sequence 25 3.0 3.3
The maximum value of the complexity parameters is 2 (log[base 2]4). For masking, the following is approximately equivalent in effect to the default parameters for amino acid sequences:
ncbi-seg sequence.na 21 1.4 1.6
The following is a file named 'prion' in FASTA format:
>PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQP HGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGA VVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV NITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPV ILLISFLIFLIVG
The command line:
ncbi-seg /usr/share/doc/ncbi-seg/examples/prion.fa
gives the standard output below
>PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR 1-49 MANLGCWMLVLFVATWSDLGLCKKRPKPGG WNTGGSRYPGQGSPGGNRY ppqggggwgqphgggwgqphgggwgqphgg 50-94 gwgqphgggwgqggg 95-112 THSQWNKPSKPKTNMKHM agaaaagavvgglggymlgsams 113-135 136-187 RPIIHFGSDYEDRYYRENMHRYPNQVYYRP MDEYSNQNNFVHDCVNITIKQH tvttttkgenftet 188-201 202-236 DVKMMERVVEQMCITQYERESQAYYQRGSS MVLFS sppvillisflifliv 237-252 253-253 G
The low-complexity sequences are on the left (lower case) and high-complexity sequences are on the right (upper case). All sequence segments read from left to right and their order in the sequence is from top to bottom, as shown by the central column of residue numbers.
The command line:
ncbi-seg /usr/share/doc/ncbi-seg/examples/prion.fa -x
gives the following FASTA-formatted file:-
>PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxTHSQWNKPSKPKTNMKHMxxxxxxxx xxxxxxxxxxxxxxxRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV NITIKQHxxxxxxxxxxxxxxDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSxxxx xxxxxxxxxxxxG
John Wootton: wootton@ncbi.nlm.nih.gov
Scott Federhen: federhen@ncbi.nlm.nih.gov
National Center for Biotechnology Information Building 38A, Room 8N805 National Library of Medicine National Institutes of Health Bethesda, Maryland, MD 20894 U.S.A.
Wootton, J.C., Federhen, S. (1993) Statistics of local complexity in amino acid sequences and sequence databases. Computers & Chemistry 17: 149-163.
Wootton, J.C. (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Computers & Chemistry 18: (in press).
Altschul, S.F., Boguski, M., Gish, W., Wootton, J.C. (1994) Issues in searching molecular sequence databases. Nature Genetics 6: 119-129.
Wootton, J.C. (1994) Simple sequences of protein and DNA. In: Nucleic Acid and Protein Sequence Analysis: A Practical Approach. (Second Edition, Chapter 8, Bishop, M.J. and Rawlings, C.R. Eds. IRL Press, Oxford) (In press).
2020-07-21 | 0.0.20000620 |