NAME

segemehl - Heuristic mapping of short sequences

SYNOPSIS

segemehl [-besVOc] -d <file> [<file>] [-q <file>] [-p <file>] [-i <file>] [-j <file>] [-x <file>] [-y <file>] [-G <file>] [-g <string>] [-t <n>] [-o <string>] [-u <file>] [-B <string>] [-F <n>] [-S [<basename>]] [-A <n>] [-D <n>] [-E <double>] [-H] [-m <n>] [-Z <n>] [-W <n>] [-U <n>] [-l <f>] [-w <double>] [-X <n>] [-J <n>] [-I <n>] [-M <n>] [-n <n>] [-r <n>] [--skipidcheck] [--showalign] [--nohead]

DESCRIPTION

Segemehl is a software to map short sequencer reads to reference genomes. Segemehl implements a matching strategy based on enhanced suffix arrays (ESA). Segemehl accepts fasta and fastq queries (gzipâ€™ed and bgzip'ed). In addition to the alignment of reads from standard DNA- and RNA-seq protocols, it also allows the mapping of bisulfite converted reads (Lister and Cokus) and implements a split read mapping strategy. The output of segemehl is a SAM or BAM formatted alignment file. In the case of split-read mapping, additional BED files are written to the disc. These BED files may be summarized with the postprocessing tool haarz. In the case of the alignment of bisulfite converted reads, raw methylation rates may also be called with haarz.

In brief, for each suffix of a read, segemehl aims to find the best-scoring seed. Seeds might contain insertions, deletions, and mismatches (differences). The number of differences allowed within a single seed is user-controlled and is crucial for the runtime of the program. Subsequently, seeds that undercut the user-defined E-value are passed on to an exact semi-global alignment procedure. Finally, reads with a minimum accuracy of percent are reported to the user.

OPTIONS

INPUT

-d, --database <file> [<file>]: list of path/filename(s) of fasta database sequence(s)
-q, --query <file>: path/filename of query sequences (default:none)
-p, --mate <file>: path/filename of mate pair sequences (default:none)
-i, --index <file>: path/filename of db index (default:none)
-j, --index2 <file>: path/filename of second db index (default:none)
-x, --generate <file>: generate db index and store to disk (default:none)
-y, --generate2 <file>: generate second db index and store to disk (default:none)
-G, --readgroupfile <file>: filename to read @RG header (default:none)
-g, --readgroupid <string>: read group id (default:none)
-t, --threads <n>: start <n> threads (default:1)

OUTPUT

-o, --outfile <string>: outputfile (default:none)
-b, --bamabafixoida: generate a bam output (-o <filename> required)
-u, --nomatchfilename <file>: filename for unmatched reads (default:none)
-e, --briefcigar: brief cigar string (M vs X and =)
-s, --progressbar: show a progress bar
-B, --filebins <string>: file bins with basename <string> for easier data handling (default:none)
-V, --MEOP: output MEOP field for easier variance calling in SAM (XE:Z:)

ALIGNMENT

-F, --bisulfite <n>: bisulfite aln with methylC-seq/Lister et al. (=1) or bs-seq/Cokus et al. protocol (=2) (default:0)
-S, --splits [<basename>]: detect split/spliced reads. (default:none)
-A, --accuracy <n>: min percentage of matches per read in semi-global alignment (default:90)
-D, --differences <n>: search seeds initially with <n> differences (default:1)
-E, --evalue <double>: max evalue (default:5.000000)
-H, --hitstrategy: report only best scoring hits (=1) or all (=0) (default:1)
-m, --minsize <n>: minimum length of queries (default:12)
-Z, --minfraglen <n>: min length of a spliced fragment (default:20)
-W, --minsplicecover <n>: min coverage for spliced transcripts (default:80)
-U, --minfragscore <n>: min score of a spliced fragment (default:18)
-l, --splicescorescale <f>: report spliced alignment with score s only if <f>*s is larger than next best spliced alignment (default:0.900000)
-w, --maxsplitevalue <double>: max evalue for splits (default:50.000000)

SPECIAL

-X, --dropoff <n>: dropoff parameter for extension (default:8)
-J, --jump <n>: search seeds with jump size <n> (0=automatic) (default:0)
-O, --order: sorts the output by chromsome and position (might take a while!)
-I, --maxpairinsertsize <n>: maximum size of the inserts (paired end) in case of multiple hits (default:200000)
-M, --maxinterval <n>: maximum width of a suffix array interval, i.e. a query seed will be omitted if it matches more than <n> times (default:100)
-c, --checkidx: check index
-n, --extensionpenalty <n>: penalty for a mismatch during extension (default:4)
-r, --maxout <n>: maximum number of alignments that will be reported. If set to zero, all alignments will be reported (default:0)
--skipidcheck: do not check whether the fastq ids of mates / paired ends match. Instead, the first mate (-q) will be used for output only.
--showalign: show alignments
--nohead: do not output header

BUGS

Please report bugs to steve@bioinf.uni-leipzig.de

REFERENCES

: 2008 Bioinformatik Leipzig
: 2018 Leibniz Institute on Aging (FLI)

AUTHOR

This manpage was written by Andreas Tille for the Debian distribution and can be used for any other usage of the program.

October 2018

segemehl 0.3