NAME

bamfeaturecount - evaluate alignments produce by an RNA-seq aligner

SYNOPSIS

bamfeaturecount annotation.gtf mapped.bam [options]

DESCRIPTION

bamfeaturecount evaluates the alignments produced by an RNA-seq aligner (like e.g. STAR) and outputs the average coverage found for each transcript and gene. It requires an annotation file in the GTF format in addition to a BAM file containing RNA-seq alignments. The annotation file needs to have been processed by the filtergtf program to ensure a sorting which is compatible with bamfeaturecounts expectations.

The output contains two types of lines, one starting with [transcript] and another starting with [gene]. In both cases the information provided is given as a tab separated set of columns.

The following is an example of a transcript line:

[transcript] chr1:C1orf159 ENST00000472741.5 (0.0618557,5.38144,97) (0.0496454,11.305,564) (0.0514372,10.4357,661) [1116059,1116087) [1091990,1092103) [1091045,1091565)

transcript lines have always at least 6 columns. The second column provides the reference sequence name (as given in the BAM header and GTF file) and the name of the gen concerned separated by a colon (here chr1 and C1orf159). The third column contains the transcript identifier (transcript_id in the GTF file). Column 4, 5 and 6 each contain either a triplet of numbers (A,B,C) or the symbol *. Column 4 contains regions unique to this transcript (i.e. stretches on the genome not shared with any other transcript). Column 5 contains regions shared by at least one other transcript. Column 6 contains the information for all regions covered by the transcript (unique and non unique). If a transcript does not have any respective intervals (i.e. if every base is also covered by at least one other transcript or all bases are unique to this transcript) then the column contains the symbol *. Otherwise a triplet (A,B,C) is given where A denotes the fraction of bases not covered by any alignment (in the example 0.0618557 or 6% of the bases unique to this transcript are not covered), B contains the average coverage (in the example the average sequencing depth on the unique bases for this transcript is 5.38) and C the total number of bases (in the example 97 bases of this transcript are not shared with any other transcript). The rest of the columns (> 6) contain the zero bases intervals of exons for this transcript on the reference sequence.

The gene lines should be disregarded in the current version of the program.

The following key=value pairs can be given:

T=<filename>: set the prefix for temporary file names

verbose=<1>: print some progress report while processing

threads=<1>: number of threads used for processing. Set this to 0 to use all cores detected on the machine.

mapqmin=<255>: Minimum mapping quality allowed for alignments considered. By default this is 255 (the value used by STAR to mark uniquely mapped reads).

mapqmax=<255>: Maximum mapping quality allowed for alignments considered. By default this is 255 (the value used by STAR to mark uniquely mapped reads).

uncoveredthres=<0.1>: maximum fraction of bases allowed to be uncovered in a transcript so the transcript will be reported (i.e. minimum value allowed for the first number given in column 6 of the transcript output).

uniqueuncoveredthres=<0.1>: maximum fraction of bases allowed to be uncovered in the unique region of a transcript so the transcript will be reported (i.e. minimum value allowed for the first number given in column 4 of the transcript output).

exclude=<SECONDARY>: Do not include reads in the output that have any of the given flags set. The flags are given separated by commas. Valid flags are:

PAIRED:: read was paired in sequencing
PROPER_PAIR:: read has been mapped as part of a proper pair
UNMAP:: read was not mapped
MUNMAP:: mate of read was not mapped
REVERSE:: read was mapped to the reverse strand
MREVERSE:: mate of read was mapped to the reverse strand
READ1:: read was first read of a pair during sequencing
READ2:: read was second read of a pair during sequencing
SECONDARY:: alignment is secondary, i.e. an alternative mapping to the primary alignment in the same file
QCFAIL:: read as marked as having failed quality control
DUP:: read is marked as a duplicate of another read in the same file (see bammarkduplicates)
SUPPLEMENTARY:: read is marked as supplementary alignment

exportcdna=<0>: instead of feature counting generate a FastA file containing the CDNA as designated by the GTF annotation file. The second parameter (BAM file) needs to be specified, but will not be read. This option requires a reference FastA file suitable for the GTF file to be provided via the reference key.

reference=<>: name of a reference FastA file. This is required for the exportcdna option.

AUTHOR

Written by German Tischler-Höhle.

REPORTING BUGS

Report bugs to <germant@miltenyibiotec.de>

COPYRIGHT

Copyright © 2009-2019 German Tischler-Höhle, © 2011-2014 Genome Research Limited. License GPLv3+: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

August 2019

BIOBAMBAM