6.3.2.1. starch¶

With high-throughput sequencing generating large amounts of genomic data, archiving can be a critical part of an analysis toolkit. BEDOPS includes the starch utility to provide a method for efficient and lossless compression of UCSC BED-formatted data into the Starch v2 format.

Starch v2 archives can be extracted with unstarch to recover the original BED input, or processed as inputs to bedops and bedmap, where set operations and element calculations can be performed directly and without the need for intermediate file extraction.

The starch utility includes large file support on 64-bit operating systems, enabling compression of more than 2 GB of data (a common restriction on 32-bit systems).

Data can be stored with one of two open-source backend compression methods, either bzip2 or gzip, providing the end user with a reasonable tradeoff between speed and storage performance that can be useful for working with constrained storage situations or slower hardware.

6.3.2.1.1. Inputs and outputs¶

6.3.2.1.1.1. Input¶

As with other BEDOPS utilities, starch takes in sorted BED data as input. You can use sort-bed to sort BED data, piping it into starch as standard input (see Example section below).

Note

While more than three columns may be specified, most of the space savings in the Starch format are derived from from a pre-processing step on the coordinates. Therefore, minimizing or removing unnecessary columnar data from the fourth column on (e.g., with cut -f1-3 or similar) can help improve compression efficiency considerably.

6.3.2.1.1.2. Output¶

This utility outputs a Starch v2-formatted archive file.

6.3.2.1.2. Requirements¶

The starch tool requires data in a relaxed variation of the BED format as described by UCSC’s browser documentation. BED data should be sorted before compression, e.g. with BEDOPS sort-bed.

At a minimum, three columns are required to specify the chromosome name and start and stop positions. Additional columns may be specified, containing up to 128 kB of data per row (including tab delimiters).

6.3.2.1.3. Usage¶

Use the --help option to list all options:

starch
 citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
 binary version: 2.3.0 (creates archive version: 2.0.0)
 authors:  Alex Reynolds and Shane Neph

USAGE: starch [--note="foo bar..."] [--bzip2 | --gzip] [--header] [<unique-tag>] <bed-file>

    * BED input must be sorted lexicographically (e.g., using BEDOPS sort-bed).
    * Please use '-' to indicate reading BED data from standard input.
    * Output must be directed to a regular file.
    * The bzip2 compression type makes smaller archives, while gzip extracts faster.

    Process Flags:

    --note="foo bar..."   Append note to output archive metadata (optional)
    --bzip2 | --gzip      Specify backend compression type (optional, default is bzip2)
    --header              Support BED input with custom UCSC track, SAM or VCF headers, or generic comments (optional)
    <unique-tag>          Specify unique identifier for transformed data (optional)
    --help                Show this usage message
    --version             Show binary version

6.3.2.1.4. Options¶

6.3.2.1.4.1. Backend compression type¶

Use the --bzip2 or --gzip operators to use the bzip2 or gzip compression algorithms on transformed BED data. By default, starch uses the bzip2 method.

6.3.2.1.4.2. Note¶

Use the --note="xyz..." option to add a custom string that describes the archive. This data can be retrieved with unstarch --note.

Tip

Examples of usage might include a description of the experiment associated with the data, a URL to a UCSC Genome Browser session, or a bar code or other unique identifier for internal lab or LIMS use.

Note

The only limitation on the length of a note is the command-line shell’s maximum argument length parameter (as found on most UNIX systems with the command getconf ARG_MAX) minus the length of the non- --note="..." command components. On most desktop systems, this value will be approximately 256 kB.

6.3.2.1.4.3. Headers¶

Add the --header flag if the BED data being compressed contain extra header data that are exported from a UCSC Genome Browser session.

Note

If the BED data contain custom headers and --header is not specified, starch will be unable to read chromosome data correctly and exit with an error state.

6.3.2.1.4.4. Unique tag¶

Adding a <unique-tag> string replaces portions of the filename key in the archive’s stream metadata.

Note

This feature is largely obsolete and included for legacy support. It is better to use the --note="xyz..." option to add identifiers or other custom data.

6.3.2.1.5. Example¶

To compress unsorted BED data (or data of unknown sort order), we feed starch a sorted stream, using the hyphen (-) to specify standard input:

$ sort-bed unsorted.bed | starch - > sorted.starch

This creates the file sorted.starch, which uses the bzip2 algorithm to compress transformed BED data from a sorted permutation of data in unsorted.bed. No note or custom tag data is added.

It is possible to speed up the compression of a BED file by using a cluster. Start by reviewing our starchcluster script.

6.3.2.1. starch¶

6.3.2.1.1. Inputs and outputs¶

6.3.2.1.1.1. Input¶

6.3.2.1.1.2. Output¶

6.3.2.1.2. Requirements¶

6.3.2.1.3. Usage¶

6.3.2.1.4. Options¶

6.3.2.1.4.1. Backend compression type¶

6.3.2.1.4.2. Note¶

6.3.2.1.4.3. Headers¶

6.3.2.1.4.4. Unique tag¶

6.3.2.1.5. Example¶

Table Of Contents

Previous topic

Next topic

This Page