NAME

bamdownsamplerandom - downsample a SAM, BAM or CRAM file

SYNOPSIS

bamdownsamplerandom [options]

DESCRIPTION

bamdownsamplerandom reads a SAM, BAM or CRAM file from standard input, randomly discards reads and writes the remaining reads to standard output in BAM format. For a pair of reads either both ends are discarded or both ends are kept. The order of reads in the output file may be different from the order in the input if the reads in the input file are not collated by their read name.

The following key=value pairs can be given:

p=<1>: probability for a pair of reads or a single end read to be kept. By default all reads are kept.

seed=<>: seed used for the random number generator. By default the current time is used, i.e. each run of the program will select a different subset of reads from an input file. If the behaviour of the program needs to be reproducible a fixed number can be used as the random seed.

I=<stdin>: input file name (data is read from standard input if this option is not given)

inputformat=<bam>: input file format All versions of bamtofastq come with support for the BAM input format. If the program in addition is linked to the io_lib package, then the following options are valid:

bam:: BAM (see http://samtools.sourceforge.net/SAM1.pdf)
sam:: SAM (see http://samtools.sourceforge.net/SAM1.pdf)
cram:: CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)

level=<-1|0|1|9|11>: set compression level of the output BAM file. Valid values are

-1:: zlib/gzip default compression level
0:: uncompressed
1:: zlib/gzip level 1 (fast) compression
9:: zlib/gzip level 9 (best) compression

If libmaus has been compiled with support for igzip (see https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data) then an additional valid value is

11:: igzip compression

exclude=<SECONDARY,SUPPLEMENTARY>: Do not include reads in the output that have any of the given flags set. The flags are given separated by commas. Valid flags are:

PAIRED:: read was paired in sequencing
PROPER_PAIR:: read has been mapped as part of a proper pair
UNMAP:: read was not mapped
MUNMAP:: mate of read was not mapped
REVERSE:: read was mapped to the reverse strand
MREVERSE:: mate of read was mapped to the reverse strand
READ1:: read was first read of a pair during sequencing
READ2:: read was second read of a pair during sequencing
SECONDARY:: alignment is secondary, i.e. an alternative mapping to the primary alignment in the same file
QCFAIL:: read as marked as having failed quality control
DUP:: read is marked as a duplicate of another read in the same file (see bammarkduplicates)
SUPPLEMENTARY:: read is marked as supplementary alignment

disablevalidation=<0>: Valid values are

0:: run input file validation on alignments (this is the default)
1:: do not check the validity of the input file (this may help for some broken input files, but it is a security risk as it can lead to the execution of arbitrary code through a forged input file).

colhlog=<18> base two logarithm of the size of the hash table used for collation (the default value is 18 and should work reasonably well for most input files. Please see the biobambam paper at arxiv.org/abs/1306.0836 for details).

colsbs=<128M> size of hash table overflow list in bytes (the default is 128MB and should work reasonably well for most input files. Please see the biobambam paper at arxiv.org/abs/1306.0836 for details).

T=<bamdownsamplerandom_hostname_pid_time> file name of temporary file used for collation

ranges=<>: coordinate ranges selected from input. This option is only available for input files in BAM format which have a corresponding index (.bai file) and if input is via file (i.e. the I argument is set). Valid ranges consist either of

whole reference sequence:: a whole reference sequence (e.g. "chr1")
half open interval on reference sequence:: an interval on a reference sequence half open on the right (e.g. "chr1:50000" which means alignments overlapping chr1 from position 50000 to the end of chr1)
interval on reference sequence:: an interval on a reference sequence (e.g. "chr1:50000-60000" which means alignments overlapping positions 50000 to 60000 on chr1)

Multiple ranges are separated by space characters (e.g. ranges="chr1:10000-20000 chr1:30000-40000").

reference=: file name of the reference for CRAM input files. If this key is unset, then the CRAM file header will be scanned for obtaining a reference file name.

tmpfile=<filename>: prefix for temporary files. By default the temporary files are created in the current directory

outputformat=<bam>: output file format. All versions of bamsort come with support for the BAM output format. If the program in addition is linked to the io_lib package, then the following options are valid:

bam:: BAM (see http://samtools.sourceforge.net/SAM1.pdf)
sam:: SAM (see http://samtools.sourceforge.net/SAM1.pdf)
cram:: CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit). This format is not advisable for data sorted by query name.

O=<[stdout]>: output filename, standard output if unset.

outputthreads=<[1]>: output helper threads, only valid for outputformat=bam.

md5=<0|1>: md5 checksum creation for output file. This option can only be given if outputformat=bam. Then valid values are

0:: do not compute checksum. This is the default.
1:: compute checksum. If the md5filename key is set, then the checksum is written to the given file. If md5filename is unset, then no checksum will be computed.

md5filename file name for md5 checksum if md5=1.

index=<0|1>: compute BAM index for output file. This option can only be given if outputformat=bam. Then valid values are

0:: do not compute BAM index. This is the default.
1:: compute BAM index. If the indexfilename key is set, then the BAM index is written to the given file. If indexfilename is unset, then no BAM index will be computed.

indexfilename file name for output BAM index if index=1.

hash=<0|1>: use hash of query name instead of a random number for selection. This makes the output depend on how random the hashes produced for the query names are, but it has the advantage of not requiring collation to keep pairs together. In contast the order of retained reads does not change for hash=1.

AUTHOR

Written by German Tischler.

REPORTING BUGS

Report bugs to <germant@miltenyibiotec.de>

COPYRIGHT

Copyright © 2009-2014 German Tischler, © 2011-2014 Genome Research Limited. License GPLv3+: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

October 2014

BIOBAMBAM