SIM4(1) | General Commands Manual | SIM4(1) |
sim4 - align an expressed DNA sequence with a genomic sequence
sim4 seqfile1 seqfile2 {[WXKCRDAPNB]=value}
sim4 is a similarity-based tool for aligning an expressed DNA sequence (EST, cDNA, mRNA) with a genomic sequence for the gene. It also detects end matches when the two input sequences overlap at one end (i.e., the start of one sequence overlaps the end of the other). If seqfile2 is a database of sequences, the sequence in seqfile1 will be aligned with each of the sequences in seqfile2.
sim4 employs a blast-based technique to first determine the basic matching blocks representing the "exon cores". In this first stage, it detects all possible exact matches of W-mers (i.e., DNA words of size W) between the two sequences and extends them to maximal scoring gap-free segments. In the second stage, the exon cores are extended into the adjacent as-yet-unmatched fragments using greedy alignment algorithms, and heuristics are used to favor configurations that conform to the splice-site recognition signals (GT-AG, CT-AC). If necessary, the process is repeated with less stringent parameters on the unmatched fragments.
By default, sim4 searches both strands and reports the best match, measured by the number of matching nucleotides found in the alignment. The R command line option can be used to restrict the search to one orientation (strand) only.
Currently, five major alignment display options are supported, controlled by the A option. By default (A=0), only the endpoints, overall similarity, and orientation of the introns are reported. An arrow sign (`->' or `<-') indicates the orientation of the intron (`+' or `-' strand), when the signals flanking the intron have three or more position matches with either the GT-AG or the CT-AC splice recognition signals. When the same number of matches is found for both orientations, the intron is reported as ambiguous, and represented by `--'. The sign `==' marks the absence from the alignment of a cDNA fragment starting at that position. Alternative formats (lav-block format, text, PipMaker-type `exons file', or certain combinations of these options) can be requested by specifying a different value for A.
If the P option is specified with a non-zero value, sim4 will remove any 3'-end poly-A tails that it detects in the alignment.
Occasionally, sim4 may miss an internal exon when surrounded by very large introns, typically longer than 100 Kb. When this is suspected, the H option can be used to reset the exons' weight to compensate for the intron gap penalty.
Ambiguity codes are by default allowed in sequence data, but sim4 treats them non-differentially. If desired, the B command option can restrict the set of acceptable characters to A,C,G,T,N and X only.
sim4 compares the lengths of the input sequences to distinguish between the cDNA (`short') and the genomic (`long') components in the comparison. When seqfile2 contains a collection of sequences, the first entry in the file will be used to determine the type of this and all subsequent comparisons.
In the description below, the term MSP denotes a Maximal Segment Pair, that is, a pair of highly similar fragments in the two sequences, obtained during the blast-like procedure by extending a W-mer hit by matches and perhaps a few mismatches.
The algorithm parameters (included in the first two sections below) have already been tuned and do not normally require adjustment by the user.
Parameters internal to the blast-like procedure:
Additional algorithm parameters:
Context parameters:
sim4 est genomic
sim4 genomic estdb
sim4 est genomic A=1 P=1
sim4 est1 est2 R=1
sim4 mRNA genomic A=5 S=123..1020
sim4 mouse_cDNA human_genomic K=15 C=11 A=3 W=10
sim4 was written by Liliana Florea <florea@gwu.edu> and Scott Schwartz.
This manual page was written by Nelson A. de Oliveira <naoliv@gmail.com>, based on the online documentation at http://globin.cse.psu.edu/html/docs/sim4.html, for the Debian project (but may be used by others).
Wed, 03 Aug 2005 18:40:58 -0300 |