NAME

gmap_build - Tool for genome database creation for GMAP or GSNAP

SYNOPSIS

gmap_build [options...] -d <genome> [-c <transcriptome> -T <transcript_fasta>] <genome_fasta_files>

DESCRIPTION

gmap_build: Builds a gmap database for a genome to be used by GMAP or GSNAP. Part of GMAP package, version 2021-12-17.

You are free to name <genome> and <transcriptome> as you wish. You will use the same names when performing alignments subsequently using GMAP or GSNAP.

Note: If adding a transcriptome to an existing genome, then there is no need to specify the genome_fasta_files. This way you can add transcriptome information to an existing genome database.

OPTIONS

-D, --dir=STRING: Destination directory for installation (defaults to gmapdb directory specified at configure time)
-d, --genomedb=STRING: Genome name (required)
-n, --names=STRING: Substitute names for contigs, provided in a file.

: The file can have two formats:

1.: A file with one column per line, with each line corresponding to a FASTA file, in the order given to gmap_build. The chromosome name for each FASTA file will be replaced with the desired chromosome name in the file. Every chromosome in the FASTA must have a corresponding line in the file. This is useful if you want to rename chromosomes with a systematic numbering pattern.
2.: A file with two columns per line, separated by white space. In each line, the original FASTA chromosome name should be in column 1 and the desired chromosome name will be in column 2.

: The meaning of file format 2 depends on whether --limit-to-names is specified. If so, the genome build will be limited to those chromosomes in this file. Otherwise, all chromosomes in the FASTA file will be included, but only those chromosomes in this file will be re-named, which provides an easy way to change just a few chromosome names.
: This file can be combined with the --sort=names option, in which the order of chromosomes is that given in the file. In this case, every chromosome must be listed in the file, and for chromosome names that should not be changed, column 2 can be blank (or the same as column 1). The option of a blank column 2 is allowed only when specifying --sort=names, because otherwise, the program cannot distinguish between a 1-column and 2-column names file.

-L, --limit-to-names: Determines whether to limit the genome build to the lines listed in the --names file. You can limit a genome build to certain chromosomes with this option, plus a --names file that either renames chromosomes, or lists the same names in both columns for the desired chromosomes.
-k, --kmer=INT: k-mer value for genomic index (allowed: 15 or less, default is 15)
-q INT: sampling interval for genomoe (allowed: 1-3, default 3)
-s, --sort=STRING: Sort chromosomes using given method: none - use chromosomes as found in FASTA file(s) (default) alpha - sort chromosomes alphabetically (chr10 before chr 1) numeric-alpha - chr1, chr1U, chr2, chrM, chrU, chrX, chrY chrom - chr1, chr2, chrM, chrX, chrY, chr1U, chrU names - sort chromosomes based on file provided to --names flag
-g, --gunzip: Files are gzipped, so need to gunzip each file first
-E, --fasta-pipe=STRING: Interpret argument as a command, instead of a list of FASTA files
-Q, --fastq: Files are in FASTQ format
-R, --revcomp: Reverse complement all contigs
-w INT: Wait (sleep) this many seconds after each step (default 2)
-o, --circular=STRING: Circular chromosomes (either a list of chromosomes separated by a comma, or a filename containing circular chromosomes, one per line). If you use the --names feature, then you should use the substitute name of the chromosome, not the original name, for this option. (NOTE: This behavior is different from previous versions, and starts with version 2020-10-20.)
-2, --altscaffold=STRING: File with alt scaffold info, listing alternate scaffolds, one per line, tab-delimited, with the following fields: (1) alt_scaf_acc, (2) parent_name, (3) orientation, (4) alt_scaf_start, (5) alt_scaf_stop, (6) parent_start, (7) parent_end.
-e, --nmessages=INT: Maximum number of messages (warnings, contig reports) to report (default 50)

Options for older genome formats:

-M, --mdflag=STRING: Use MD file from NCBI for mapping contigs to chromosomal coordinates
-C, --contigs-are-mapped: Find a chromosomal region in each FASTA header line. Useful for contigs that have been mapped to chromosomal coordinates. Ignored if the --mdflag is provided.

Options for transcriptome-guided alignment:

-c, --transcriptomedb=STRING: Transcriptome name
-T, --transcripts=FILE: FASTA file containing transcripts (required if specifying --transcriptomedb)
-t, --nthreads=INT: Number of threads for GMAP alignment of transcripts to genome (default 8)
Other tools of GMAP suite are located in /usr/lib/gmap

October 2022

gmap_build 2021-12-17+ds-3