MAQ(1) | Bioinformatics Tools | MAQ(1) |
Maq - Mapping and Assembly with Qualities
maq command [options] arguments
maq.pl command [options] arguments
Maq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. It is particularly designed for Illumina-Solexa 1G Genetic Analyzer, and has a preliminary functionality to handle AB SOLiD data.
With Maq you can:
However, Maq can NOT:
Key Commands
Convert sequences in FASTA format to Maq's BFA (binary FASTA) format.
Convert reads in FASTQ format to Maq's BFQ (binary FASTQ) format.
OPTIONS:
Map reads to the reference sequences.
OPTIONS:
NOTE:
Merge a batch of read alignments together.
NOTE:
Remove pairs with identical outer coordinates. In principle, pairs with identical outer coordinates should happen rarely. However, due to the amplification in sample preparation, this occurs much more frequently than by chance. Practical analyses show that removing duplicates helps to improve the overall accuracy of SNP calling.
Call the consensus sequences from read mapping.
OPTIONS:
NOTE:
Calculate log-likelihood for all genotypes and store the results in GLF format (Genotyping Likelihood Format). Please check MAQ website for detailed descriptions of the file format and the related utilities.
Call consistent indels from paired end reads. The output is TAB delimited with each line consisting of chromosome, start position, type of the indel, number of reads across the indel, size of the indel and inserted/deleted nucleotides (separated by colon), number of indels on the reverse strand, number of indels on the forward strand, 5' sequence ahead of the indel, 3' sequence following the indel, number of reads aligned without indels and three additional columns for filters.
At the 3rd column, type of the indel, a star indicates the indel is confirmed by reads from both strands, a plus means the indel is hit by at least two reads but from the same strand, a minus shows the indel is only found on one read, and a dot means the indel is too close to another indel and is filtered out.
Users are recommended to run through `maq.pl indelpe' to correct the number of reads mapped without indels. For more details, see the `maq.pl indelpe' section.
Call potential homozygous indels and break points by detecting the abnormal alignment pattern around indels and break points. The output is also TAB delimited with each line consisting of chromosome, approximate coordinate, length of the abnormal region, number of reads mapped across the position, number of reads on the left-hand side of the position and number of reads on the right-hand side. The last column can be ignored.
The output contains many false positives. A recommended filter could be:
awk '$5+$6-$4 >= 3 && $4 <= 1' in.indelsoa
Note that this command does not aim to be an accurate indel
detector, but mainly helps to avoid some false positives in substitution
calling. In addition, it only works well given deep depth (~40X for
example); otherwise the false negative rate would be very high.
Format Converting
Convert Solexa FASTQ to standard/Sanger FASTQ format.
Convert Maq's BFQ format to standard FASTQ format.
Convert obsolete mapass2's map format to Maq's map format. The
old format does not contain read names.
Information Extracting
Display the read alignment in plain text. For reads aligned before the Smith-Waterman alignment, each line consists of read name, chromosome, position, strand, insert size from the outer coorniates of a pair, paired flag, mapping quality, single-end mapping quality, alternative mapping quality, number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, number of 0-mismatch hits of the first 24bp, number of 1-mismatch hits of the first 24bp on the reference, length of the read, read sequence and its quality. Alternative mapping quality always equals to mapping quality if the reads are not paired. If reads are paired, it equals to the smaller mapping quality of the two ends. This alternative mapping quality is actually the mapping quality of an abnormal pair.
The fifth column, paired flag, is a bitwise flag. Its lower 4 bits give the orientation: 1 stands for FF, 2 for FR, 4 for RF, and 8 for RR, where FR means that the read with smaller coordinate is on the forward strand, and its mate is on the reverse strand. Only FR is allowed for a correct pair. The higher bits of this flag give further information. If the pair meets the paired end requirement, 16 will be set. If the two reads are mapped to different chromosomes, 32 will be set. If one of the two reads cannot be mapped at all, 64 will be set. The flag for a correct pair always equals to 18.
For reads aligned by the Smith-Waterman alignment afterwards, the flag is always 130. A line consists of read name, chromosome, position, strand, insert size, flag (always 130), position of the indel on the read (0 if no indel), length of the indels (positive for insertions and negative for deletions), mapping quality of its mate, number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, two zeros, length of the read, read sequence and its quality. The mate of a 130-flagged read always gets a flag 18.
Flag 192 indicates that the read is not mapped but its mate is mapped. For such a read pair, one read has flag 64 and the other has 192.
OPTIONS:
Read quality check. The mapcheck first reports the composition and the depth of the reference. After that there is a form. The first column indicates the position on a read. Following four columns which show the nucleotide composition, substitution rates between the reference and reads will be given. These rates and the numbers in the following columns are scaled to 999 and rounded to nearest integer. The next group of columns show the distribution of base qualities along the reads at a quality interval of 10. A decay in quality can usually be observed, which means bases at the end of read are less accurate. The last group of columns present the fraction of substitutions for read bases at a quality interval. This measures the accuracy of base quality estimation. Idealy, we expect to see 1 in the 3? column, 10 in the 2? column and 100 in the 1? column.
OPTIONS:
Display the alignment in a `pileup' text format. Each line consists of chromosome, position, reference base, depth and the bases on reads that cover this position. If -v is added on the command line, base qualities and mapping qualities will be presented in the sixth and seventh columns in order.
The fifth column always starts with `@'. In this column, read bases identical to the reference are showed in comma `,' or dot `.', and read bases different from the reference in letters. A comma or a upper case indicates that the base comes from a read aligned on the forward strand, while a dot or a lower case on the reverse strand.
This command is for users who want to develop their own SNP callers.
OPTIONS:
Extract the consensus sequences in FASTQ format. In the sequence lines, bases in lower case are essentially repeats or do not have sufficient coverage; bases in upper case indicate regions where SNPs can be reliably called. In the quality lines, ASCII of a character minus 33 gives the PHRED quality.
OPTIONS:
Extract SNP sites. Each line consists of chromosome, position, reference base, consensus base, Phred-like consensus quality, read depth, the average number of hits of reads covering this position, the highest mapping quality of the reads covering the position, the minimum consensus quality in the 3bp flanking regions at each side of the site (6bp in total), the second best call, log likelihood ratio of the second best and the third best call, and the third best call.
The 5th column is the key criterion when you judge the reliability of a SNP. However, as this quality is only calculated assuming site independency, you should also consider other columns to get more accurate SNP calls. Script command `maq.pl SNPfilter' is designed for this (see below).
The 7th column implies whether the site falls in a repetitive region. If no read covering the site can be mapped with high mapping quality, the flanking region is possibly repetitive or in the lack of good reads. A SNP at such site is usually not reliable.
The 8th column roughly gives the copy number of the flanking region in the reference genome. In most cases, this number approaches 1.00, which means the region is about unique. Sometimes you may see non-zero read depth but 0.00 at the 7th column. This indicates that all the reads covering the position have at least two mismatches. Maq only counts the number of 0- and 1-mismatch hits to the reference. This is due to a complex technical issue.
The 9th column gives the neighbouring quality. Filtering on this column is also required to get reliable SNPs. This idea is inspired by NQS, although NQS is initially designed for a single read instead of a consensus.
Show detailed information at all sites. The output format is identical to cns2snp report.
Extract the reference sequence.
Extract information averaged in a tilling window. The output is TAB delimited, which consists of reference name, coordinate divided by 1,000,000, SNP rate, het rate, raw read depth, read depth in approximately unique regions, the average number of hits of reads in the window and percent GC.
OPTIONS:
Simulation Related
Randomly introduce substitutions and indels to the reference. Substitutions and sinlge base-pair indels can be added.
OPTIONS:
Estimate/train parameters for read simulation.
Simulate paired end reads. File in.simupars.dat determines the read lengths and quality distribution. It is generated from simutrain, or can be downloaded from Maq website. In the output read files, a read name consists of the reference sequence name and the outer coordinates of the pair of simulated reads. By default, simulate assumes reads come from a diploid sequence which is generated by adding two different sets of mutations, including one base-pair indels, to in.ref.fasta.
OPTIONS:
NOTE:
Evaluate mapping qualities from simulated reads.
SOLiD Related
Convert nucleotide FASTA to colour-coded FASTA. Flag -c should be then applied to map command. In the output, letter `A' stands for color 0, `C' for 1, `G' for 2 and `T' for 3. Each sequence in the output is 1bp shorter than the input.
Convert color alignment to nucleotide alignment. The input
in.ref.nt.bfa is the nucleotide binary FASTA reference file. It
must correspond to the original file from which the color reference is
converted. Nucleotide consensus can be called from the resultant
alignment.
Miscellaneous/Advanced Commands
Filter bad alignments in in.map. Command-line options are described in the `assemble' command.
Convert eland alignment to maq's .map format. File in.list consists of the sequence names that appear at the seventh column of the eland alignment file in.eland and the name you expect to see in maq alignment. The following is an example:
cX.fa chrX c1.fa chr1 c2.fa chr2
If you are aligning reads in several batches using eland, it is important to use the same in.list for the conversion. In addition, maq will load all the alignments and sort them in the memory. If you have concatenate several eland outputs into one huge file, you should separate it into smaller files to prevent maq from eating all your machine memory.
This command actually aims to show Eland alignment in Maqview. As no quality information is available, the resultant maq alignment file should not be used to call consensus genotypes.
Convert Illumina's Export format to Maq's .map format. Export format is a new alignment format since SolexaPipeline-0.3.0 which also calculates mapping qualities like maq. The resultant file can be used to call consensus genotypes as most of necessary information is available for maq to do this accurately.
OPTIONS:
Demonstrate the use of maq and its companion scripts. This command will simulate reads from a FASTA file in.fasta. The sequence length and qualities are determined by in.simudat which is generated from maq simutrain or can be downloaded from Maq website. The simulated reads will then be mapped with maq.pl easyrun. The alignment accuracy is evaluated by maq simustat, the consensus accuracy by maq simucns, and the SNP accuracy by maq_eval.pl.
By default, paired end reads will be simulated and a diploid sequence will be generated from the input by adding mutations to either haploid type. The insert size and mutation rate are controlled by maq simulate.
OPTIONS:
NOTE:
Analyses pipeline for small genomes. Easyrun command will run most of analyses implemented in maq. By default, easyrun assumes all the input read sequences files are single-end and independent; when -p is specified, two read sequence files are required, one for each end.
Several files will be generated in out.dir, among which the following files are the key output:
OPTIONS:
NOTES:
Rule out SNPs that are covered by few reads (specified by -d), by too many reads (specified by -D), near (specified by -w) to a potential indel, falling in a possible repetitve region (characterized by -Q), or having low-quality neighbouring bases (specified by -n). If maxWinSNP or more SNPs appear in any densWinSize window, they will also be filtered out together.
OPTIONS:
Correct the number of reads mapped without indels for homopolymer tracts. This command modify the 4th, 10th and the last three columns of in.indelpe and output the result in out.indelpe. After the correction, the following awk command gives putative homozygous indels:
awk '($3=="*"⎪⎪$3=="+") && $6+$7>=3 && ($6+$7)/$4>=0.75'
and the following gives heterozygotes:
awk '($3=="*"⎪⎪$3=="+") && $6+$7>=3 && ($6+$7)/$4<0.75'
Please note that this indelpe command just implements several heuristic rules. It does not correct for impure homopolymer runs or di-nucleotide/triplet repeats. Consequently, the two awk commands only give approximate hom/het indels.
GNU General Public License, version 3 (GPLv3)
<http://maq.sourceforge.net>
Heng Li <lh3@sanger.ac.uk>
2008-09-22 | maq-0.7.1 |