bgzip(1) | Bioinformatics tools | bgzip(1) |
bgzip - Block compression/decompression utility
bgzip [-cdfhikrt] [-b virtualOffset] [-I index_name] [-l compression_level] [-s size] [-@ threads] [file]
Bgzip compresses files in a similar manner to, and compatible with, gzip(1). The file is compressed into a series of small (less than 64K) 'BGZF' blocks. This allows indexes to be built against the compressed file and used to retrieve portions of the data without having to decompress the entire file.
If no files are specified on the command line, bgzip will compress (or decompress if the -d option is used) standard input to standard output. If a file is specified, it will be compressed (or decompressed with -d). If the -c option is used, the result will be written to standard output, otherwise when compressing bgzip will write to a new file with a .gz suffix and remove the original. When decompressing the input file must have a .gz suffix, which will be removed to make the output name. Again after decompression completes the input file will be removed.
The BGZF format written by bgzip is described in the SAM format specification available from http://samtools.github.io/hts-specs/SAMv1.pdf.
It makes use of a gzip feature which allows compressed files to be concatenated. The input data is divided into blocks which are no larger than 64 kilobytes both before and after compression (including compression headers). Each block is compressed into a gzip file. The gzip header includes an extra sub-field with identifier 'BC' and the length of the compressed block, including all headers.
The index format is a binary file listing pairs of compressed and uncompressed offsets in a BGZF file. Each compressed offset points to the start of a BGZF block. The uncompressed offset is the corresponding location in the uncompressed data stream.
All values are stored as little-endian 64-bit unsigned integers.
The file contents are:
uint64_t number_entries
followed by number_entries pairs of:
uint64_t compressed_offset uint64_t uncompressed_offset
# Compress stdin to stdout bgzip < /usr/share/dict/words > /tmp/words.gz # Make a .gzi index bgzip -r /tmp/words.gz # Extract part of the data using the index bgzip -b 367635 -s 4 /tmp/words.gz # Uncompress the whole file, removing the compressed copy bgzip -d /tmp/words.gz
The BGZF library was originally implemented by Bob Handsaker and modified by Heng Li for remote file access and in-memory caching.
18 August 2022 | htslib-1.16 |