DOKK / manpages / debian 12 / gmap / gsnap.1.en
GSNAP(1) User Commands GSNAP(1)

gsnap - Genomic Short-read Nucleotide Alignment Program

gsnap [OPTIONS...] <FASTA file>, or cat <FASTA file> | gmap [OPTIONS...]

Genome directory. Default (as specified by --with-gmapdb to the configure program) is /var/cache/gmap
Genome database
Whether to use the local hash tables, which help with finding extensions to the ends of alignments in the presence of splicing or indels (0=no, 1=yes if available (default))

Transcriptome-guided options (optional)

Transcriptome directory. Default is the value for --dir above
Transcriptome database
Use only the transcriptome index and not the genome index

Computation options

kmer size to use in genome database (allowed values: 16 or less) If not specified, the program will find the highest available kmer size in the genome database
Sampling to use in genome database. If not specified, the program will find the smallest available sampling value in the genome database within selected k-mer size
Process only the i-th out of every n sequences e.g., 0/100 or 99/100 (useful for distributing jobs to a computer farm).
Size of input buffer (program reads this many sequences at a time for efficiency) (default 10000)
Amount of barcode to remove from start of every read before alignment (default 0)
Amount of trim to remove from the end of every read before alignment (default 0)
Orientation of paired-end reads Allowed values: FR (fwd-rev, or typical Illumina; default), RF (rev-fwd, for circularized inserts), or FF (fwd-fwd, same strand), or 10X (single-cell where read 1 has barcode information; read 2 is rev)
--10x-whitelist=FILE
Whitelist of 10X Genomics GEM bead barcodes, needed to perform correction of cellular barcodes. This file can be obtained at cellranger-x.y.z/lib/python/cellranger/barcodes (for Cell Ranger version >= 4) cellranger-x.y.z/lib/cellranger-cs/x.y.z/lib/python/cellranger/barcodes (<= 3)
--10x-well-position=INT
Position of well information in the accession, when separated by colons If set to 0, then no well information will be printed in the CB field (default: 4)
Starting position of identifier in FASTQ header, space-delimited (>= 1)
Ending position of identifier in FASTQ header, space-delimited (>= 1)
@HWUSI-EAS100R:6:73:941:1973#0/1
start=1, end=1 (default) => identifier is HWUSI-EAS100R:6:73:941:1973#0
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
start=1, end=1 => identifier is SRR001666.1 start=2, end=2 => identifier is 071112_SLXA-EAS1_s_7:5:1:817:345 start=1, end=2 => identifier is SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345
When multiple FASTQ files are provided on the command line, GSNAP assumes they are matching paired-end files. This flag treats each file as single-end.
Skips reads marked by the Illumina chastity program. Expecting a string after the accession having a 'Y' after the first colon, like this:
@accession 1:Y:0:CTTGTA
where the 'Y' signifies filtering by chastity. Values: off (default), either, both. For 'either', a 'Y' on either end of a paired-end read will be filtered. For 'both', a 'Y' is required on both ends of a paired-end read (or on the only end of a single-end read).
Allows accession names of reads to mismatch in paired-end files
Input is in interleaved format (one read per line, tab-delimited
Uncompress gzipped input files
Uncompress bzip2-compressed input files

Computation options

Batch mode (default = 2)


Mode Hash offsets Hash positions Genome Local hash offsets Local hash positions

0
allocate mmap mmap allocate mmap
1
allocate mmap & preload mmap allocate mmap & preload
2
allocate mmap & preload mmap & preload allocate mmap & preload
3
allocate allocate mmap & preload allocate allocate
(default)

4 allocate allocate allocate allocate allocate
A batch level of 5 means the same as 4, and is kept only for backward compatibility
If 1, then allocated memory is shared among all processes on this node If 0 (default), then each process has private allocated memory
Load files indicated by --batch mode into shared memory for use by other GMAP/GSNAP processes on this node, and then exit. Ignore any input files.
Unload files indicated by --batch mode into shared memory, or allow them to be unloaded when existing GMAP/GSNAP processes on this node are finished with them. Ignore any input files.
Maximum number of mismatches allowed (if not specified, then GSNAP tries to find the best possible match in the genome) If specified between 0.0 and 1.0, then treated as a fraction of each read length. Otherwise, treated as an integral number of mismatches (including indel and splicing penalties). Default is 0.3
If GSNAP is run under SNP-tolerant or masked genome mode, then the --max-mismatches parameter above is for mismatches against reference and SNPs, or against the unmasked genome, respectively. The --max-ref-mismatches parameter is against mismatches against the reference genome or masked genome.
Minimum coverage required for an alignment. If specified between 0.0 and 1.0, then treated as a fraction of each read length. Otherwise, treated as an integral number of base pairs. Default value is 0.5.
Whether to count mismatches in trimmed part of alignment (1, yes) or mismatches to the ends of the read (0, no), when applying the --max-mismatches and --max-ref-mismatches parameters. Default for RNA-Seq is 1 (yes) so we can allow for reads that align past the ends of an exon. Default for DNA-Seq is 0 (no).
is performed at probable splice sites, to allow for reads that align past the ends of an exon.
Whether to count unknown (N) characters in the query as a mismatch (0=no (default), 1=yes)
Whether to count unknown (N) characters in the genome as a mismatch (0=no, 1=yes). If --use-mask is specified, default is no, otherwise yes.
Penalty for an indel (default 2). Counts against mismatches allowed. To find indels, make indel-penalty less than or equal to max-mismatches. A value < 2 can lead to false positives at read ends
Minimum length at end required for indel alignments (default 4)
Maximum number of middle insertions allowed (default is 0.2) If specified between 0.0 and 1.0, then treated as a fraction of each read length. Otherwise, treated as an integral number of base pairs
Maximum number of middle deletions allowed (default 0.2) If specified between 0.0 and 1.0, then treated as a fraction of each read length. Otherwise, treated as an integral number of base pairs
Maximum number of end insertions allowed (default 3)
Maximum number of end deletions allowed (default 3)
Report suboptimal hits beyond best hit (default 0) All hits with best score plus suboptimal-levels are reported
Method for removing adapters from reads. Currently allowed values: off, paired. Default is "off". To turn on, specify "paired", which removes adapters from paired-end reads if they appear to be present.
Score to use for indels when trimming at ends. To turn off trimming, specify 0. Default is -2 for both RNA-Seq and DNA-Seq. Warning: Turning trimming off in RNA-Seq can give false positive indels at the ends of reads
Use genome containing masks (e.g. for non-exons) for scoring preference
Directory for SNPs index files (created using snpindex) (default is location of genome index files specified using -D and -d)
Use database containing known SNPs (in <STRING>.iit, built previously using snpindex) for tolerance to SNPs
Directory for methylcytosine index files (created using cmetindex) (default is location of genome index files specified using -D, -V, and -d)
Directory for A-to-I RNA editing index files (created using atoiindex) (default is location of genome index files specified using -D, -V, and -d)
Alignment mode: standard (default), cmet-stranded, cmet-nonstranded, atoi-stranded, atoi-nonstranded, ttoc-stranded, or ttoc-nonstranded. Non-standard modes requires you to have previously run the cmetindex or atoiindex programs (which also cover the ttoc modes) on the genome
Number of worker threads
Controls number of candidate segments returned by the complete set algorithm Default is 10. Can be increased to higher values to solve alignments with evenly spaced mismatches at close distances. However, higher values will cause GSNAP to run more slowly. A value of 1000, for example, slows down the program by a factor of 10 or so. Therefore, change this value only if absolutely necessary.

Splicing options for DNA-Seq

Look for distant splicing involving poor splice sites (0=no, 1=yes) If not specified, then default is to be on unless only known splicing is desired (--use-splicing is specified and --novelsplicing is off)

Splicing options for RNA-Seq

Look for novel splicing (0=no (default), 1=yes)
Directory for splicing involving known sites or known introns, as specified by the -s or --use-splicing flag (default is directory computed from -D and -d flags). Note: can just give full pathname to the -s flag instead.
Look for splicing involving known sites or known introns (in <STRING>.iit), at short or long distances See README instructions for the distinction between known sites and known introns
For ambiguous known splicing at ends of the read, do not clip at the splice site, but extend instead into the intron. This flag makes sense only if you provide the --use-splicing flag, and you are trying to eliminate all soft clipping with --trim-mismatch-score=0
Definition of local novel splicing event (default 200000)
Distance to look for novel splices at the ends of reads (default 80000)
Penalty for a local splice (default 0). Counts against mismatches allowed
Sensitivity for finding fusions
Penalty for a distant splice (default 1). A distant splice is one where the intron length exceeds the value of -w, or --localsplicedist, or is an inversion, scramble, or translocation between two different chromosomes Counts against mismatches allowed
Minimum length at end required for distant spliced alignments (default 20, min allowed is the value of -k, or kmer size)
Minimum length at end required for short-end spliced alignments (default 2, but unless known splice sites are provided with the -s flag, GSNAP may still need the end length to be the value of -k, or kmer size to find a given splice
Minimum identity at end required for distant spliced alignments (default 0.95)
(Not currently implemented, since it leads to poor results) Penalty for antistranded splicing when using stranded RNA-Seq protocols. A positive value, such as 1, expects antisense on the first read and sense on the second read. Default is 0, which treats sense and antisense equally well
Report distant splices on the same chromosome as a single splice, if possible. Will produce a single SAM line instead of two SAM lines, which is also done for translocations, inversions, and scramble events

Options for paired-end reads

Max total genomic length for DNA-Seq paired reads, or other reads without splicing (default 2000). Used if -N or -s is not specified. This value is also used for circular chromosomes when splicing in linear chromosomes is allowed
Max total genomic length for RNA-Seq paired reads, or other reads that could have a splice (default 200000). Used if -N or -s is specified. Should probably match the value for -w, --localsplicedist.
Expected paired-end length, used for calling splices in medial part of paired-end reads (default 500). Was turned off in previous versions, but reinstated.
Allowable deviation from expected paired-end length, used for calling splices in medial part of paired-end reads (default 100). Was turned off in previous versions, but reinstated.

Options for quality scores

Protocol for input quality scores. Allowed values: illumina (ASCII 64-126) (equivalent to -J 64 -j -31) sanger (ASCII 33-126) (equivalent to -J 33 -j 0)
SAM output files should have quality scores in sanger protocol
Or you can customize this behavior with these flags:
FASTQ quality scores are zero at this ASCII value (default is 33 for sanger protocol; for Illumina, select 64)
Shift FASTQ quality scores by this amount in output (default is 0 for sanger protocol; to change Illumina input to Sanger output, select -31)

Output options

Maximum number of paths to print (default 100).
If more than maximum number of paths are found, then nothing is printed.
Print output in same order as input (relevant only if there is more than one worker thread)
For GSNAP output in SNP-tolerant alignment, shows all differences relative to the reference genome as lower case (otherwise, it shows all differences relative to both the reference and alternate genome)
For paired-end reads whose alignments overlap, clip the overlapping region.
For paired-end reads whose alignments overlap, merge the two ends into a single end (beta implementation)
Print detailed information about SNPs in reads (works only if -v also selected) (not fully implemented yet)
Print only failed alignments, those with no results
Exclude printing of failed alignments
Print only concordant alignments (concordant_uniq, concordant_mult, concordant_circular)
Do not print any concordant_uniq alignments
Do not print any concordant_mult alignments
Do not allow any alignments with soft clips
Another format type, other than default. Currently implemented: sam, m8 (BLAST tabular format)
Basename for multiple-file output, separately for nomapping, halfmapping_uniq, halfmapping_mult, unpaired_uniq, unpaired_mult, paired_uniq, paired_mult, concordant_uniq, and concordant_mult results
File name for a single stream of output results.
Print completely failed alignments as input FASTA or FASTQ format, to the given file, appending .1 or .2, for paired-end data. If the --split-output flag is also given, this file is generated in addition to the output in the .nomapping file.
When --split-output or --failed-input is given, this flag will append output to the existing files. Otherwise, the default is to create new files.
Among alignments tied with the best score, order those alignments in this order. Allowed values: genomic, random (default)
Buffer size, in queries, for output thread (default 1000). When the number of results to be printed exceeds this size, worker threads wait until the backlog is cleared

Options for SAM output

Do not print headers beginning with '@'
Add nomapper lines as needed to make all paired-end results alternate between first end and second end
Whether the paired bit in the SAM flags means concordant only (1) or paired plus concordant (0, default)
Print headers only for this batch, as specified by -q
Use S instead of H for hardclips
Insert 0M in CIGAR between adjacent insertions, deletions, and introns Picard disallows 0M, other tools may require it
Use extended CIGAR format (using X and = symbols instead of M, to indicate matches and mismatches, respectively
Allows multiple alignments to be marked as primary if they have equally good mapping scores
For secondary alignments (in multiple mappings), uses '*' for SEQ and QUAL fields, to give smaller file sizes. However, the output will give warnings in Picard to give warnings and may not work with downstream tools
For RNA-Seq alignments, disallows XS:A:? when the sense direction is unclear, and replaces this value arbitrarily with XS:A:+. May be useful for some programs, such as Cufflinks, that cannot handle XS:A:?. However, if you use this flag, the reported value of XS:A:+ in these cases will not be meaningful.
In MD string, when known SNPs are given by the -v flag, prints difference nucleotides as lower-case when they, differ from reference but match a known alternate allele
Extends alignments through soft clipped regions. CIGAR string and coordinates will be revised, but currently the MD string will reflect the clipped CIGAR
Action to take if there is a disagreement between CIGAR length and sequence length Allowed values: ignore, warning (default), noprint, abort Note that the noprint option does not print the CIGAR string at all if there is an error, so it may break a SAM parser
Value to put into read-group id (RG-ID) field
Value to put into read-group name (RG-SM) field
Value to put into read-group library (RG-LB) field
Value to put into read-group library (RG-PL) field

Help options

Check compiler assumptions
Show version
Show this help message

October 2022 gsnap 2021-12-17+ds-3