6.3.2.1. starch¶
With high-throughput sequencing generating large amounts of genomic data, archiving can be a critical part of an analysis toolkit. BEDOPS includes the starch
utility to provide a method for efficient and lossless compression of UCSC BED-formatted data into the Starch v2 format.
Starch v2 archives can be extracted with unstarch to recover the original BED input, or processed as inputs to bedops and bedmap, where set operations and element calculations can be performed directly and without the need for intermediate file extraction.
The starch utility includes large file support on 64-bit operating systems, enabling compression of more than 2 GB of data (a common restriction on 32-bit systems).
Data can be stored with one of two open-source backend compression methods, either bzip2
or gzip
, providing the end user with a reasonable tradeoff between speed and storage performance that can be useful for working with constrained storage situations or slower hardware.
6.3.2.1.1. Inputs and outputs¶
6.3.2.1.1.1. Input¶
As with other BEDOPS utilities, starch takes in sorted BED data as input. You can use sort-bed to sort BED data, piping it into starch as standard input (see Example section below).
Note
While more than three columns may be specified, most of the space savings in the Starch format are derived from from a pre-processing step on the coordinates. Therefore, minimizing or removing unnecessary columnar data from the fourth column on (e.g., with cut -f1-3
or similar) can help improve compression efficiency considerably.
6.3.2.1.1.2. Output¶
This utility outputs a Starch v2-formatted archive file.
6.3.2.1.2. Requirements¶
The starch tool requires data in a relaxed variation of the BED format as described by UCSC’s browser documentation. BED data should be sorted before compression, e.g. with BEDOPS sort-bed.
At a minimum, three columns are required to specify the chromosome name and start and stop positions. Additional columns may be specified, containing up to 128 kB of data per row (including tab delimiters).
6.3.2.1.3. Usage¶
Use the --help
option to list all options:
starch
citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
binary version: 2.3.0 (creates archive version: 2.0.0)
authors: Alex Reynolds and Shane Neph
USAGE: starch [--note="foo bar..."] [--bzip2 | --gzip] [--header] [<unique-tag>] <bed-file>
* BED input must be sorted lexicographically (e.g., using BEDOPS sort-bed).
* Please use '-' to indicate reading BED data from standard input.
* Output must be directed to a regular file.
* The bzip2 compression type makes smaller archives, while gzip extracts faster.
Process Flags:
--note="foo bar..." Append note to output archive metadata (optional)
--bzip2 | --gzip Specify backend compression type (optional, default is bzip2)
--header Support BED input with custom UCSC track, SAM or VCF headers, or generic comments (optional)
<unique-tag> Specify unique identifier for transformed data (optional)
--help Show this usage message
--version Show binary version
6.3.2.1.4. Options¶
6.3.2.1.4.1. Backend compression type¶
Use the --bzip2
or --gzip
operators to use the bzip2
or gzip
compression algorithms on transformed BED data. By default, starch uses the bzip2
method.
6.3.2.1.4.2. Note¶
Use the --note="xyz..."
option to add a custom string that describes the archive. This data can be retrieved with unstarch --note
.
Tip
Examples of usage might include a description of the experiment associated with the data, a URL to a UCSC Genome Browser session, or a bar code or other unique identifier for internal lab or LIMS use.
Note
The only limitation on the length of a note is the command-line shell’s maximum argument length parameter (as found on most UNIX systems with the command getconf ARG_MAX
) minus the length of the non- --note="..."
command components. On most desktop systems, this value will be approximately 256 kB.
6.3.2.1.4.3. Headers¶
Add the --header
flag if the BED data being compressed contain extra header data that are exported from a UCSC Genome Browser session.
Note
If the BED data contain custom headers and --header
is not specified, starch will be unable to read chromosome data correctly and exit with an error state.
6.3.2.1.4.4. Unique tag¶
Adding a <unique-tag>
string replaces portions of the filename key in the archive’s stream metadata.
Note
This feature is largely obsolete and included for legacy support. It is better to use the --note="xyz..."
option to add identifiers or other custom data.
6.3.2.1.5. Example¶
To compress unsorted BED data (or data of unknown sort order), we feed starch a sorted stream, using the hyphen (-
) to specify standard input:
$ sort-bed unsorted.bed | starch - > sorted.starch
This creates the file sorted.starch
, which uses the bzip2
algorithm to compress transformed BED data from a sorted permutation of data in unsorted.bed
. No note or custom tag data is added.
It is possible to speed up the compression of a BED file by using a cluster. Start by reviewing our starchcluster script.