.. _starchcat:

`starchcat`
===========

The ``starchcat`` utility efficiently merges per-chromosome records contained within one or more BEDOPS :ref:`Starch-formatted <starch_specification>` archives. This is an equivalent operation to ``bedops --everything`` or ``bedops -u`` (a :ref:`multiset union <bedops_everything>`), but inputs are :ref:`starch` archives rather than uncompressed BED files.

As a further advantage to using this over :ref:`bedops`, in the case where a :ref:`starch` input contains BED elements exclusive to one chromosome, this utility will directly and quickly copy over compressed elements to a new archive, avoiding the need for costly and wasteful extraction and re-compression. 

In the general case, where two or more :ref:`starch` inputs contain BED elements from the same chromosome, a :ref:`sorted <sort-bed>` merge is performed and the stream reprocessed into a :ref:`Starch-formatted <starch_specification>` archive.

===============
Parallelization
===============

Those with access to a computational cluster such as an Oracle/Sun Grid Engine or a group of hosts running SSH services should find :ref:`starchcat` highly useful, as this facilitates:

* Much faster compression of an entire genome of BED data, using nodes of a computational cluster to compress separate chromosomes, followed by a collation step with :ref:`starchcat` (see the :ref:`starchcluster` documentation).

* Extraction, manipulation and reintegration of a :ref:`starch` -ed chromosome into a larger :ref:`starch` archive

* Refreshing metadata or re-compressing the data within a lone :ref:`starch` archive.

To demonstrate the first application of this utility, we have packaged a helper script with the BEDOPS suite called :ref:`starchcluster <starchcluster>`, which archives data much faster than :ref:`starch` alone. By distributing work across the nodes of a computational cluster, the upper bound on compression time is almost entirely determined by the largest chromosome, reducing compression time by an order of magnitude.

==================
Inputs and outputs
==================

-----
Input
-----

The input to :ref:`starchcat` consists of one or more BEDOPS :ref:`Starch-formatted <starch_specification>` archive files. 

.. note:: If a single archive is provided as input, it may be reprocessed with specified options. When two or more archives are specified, the output will be the equivalent of a multiset union of the inputs.

.. note:: This utility does not accept standard input. 

------
Output
------

The :ref:`starchcat` tool outputs a :ref:`starch` -formatted archive to standard output, which is usually redirected to a file.

Additionally, an optional compression flag specifies if the final :ref:`starch` output should be compressed with either the ``bzip2`` or ``gzip`` method (the default being ``bzip2``). 

.. note:: If :ref:`starch` inputs use a different backend compression method, the input stream is re-compressed before integrated into the larger archive. This will incur extra processing overhead.

=====
Usage
=====

Use the ``--help`` option to list all options:

::

  starchcat
    citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
    version:  2.3.0
    authors:  Alex Reynolds and Shane Neph

  USAGE: starchcat [ --note="..." ] [ --bzip2 | --gzip ] <starch-file-1> [<starch-file-2> ...]
 
      * At least one lexicographically-sorted, headerless starch archive is required.
        While two or more inputs make sense for a multiset union operation, you can starchcat 
        one file in order to update its metadata, recompress it with a different backend method,
        or add a note annotation.

      * Compressed data are sent to standard output. Use the '>' operator to redirect
        to a file.

      Process Flags:

      --note="foo bar..."   Append note to output archive metadata (optional)
      --bzip2 | --gzip      Specify backend compression type (optional, default is bzip2)
      --version             Show binary version
      --help                Show this usage message

-------
Example
-------

Let's say we have a set of 23 :ref:`starch` archives, one for each chromosome of the human genome: ``chr1.starch``, ``chr2.starch``, and so on, to ``chrY.starch``. (To simplify this example, we leave out mitochondrial, random, pseudo- and other chromosomes.) We would like to build a new :ref:`starch` archive from these 23 separate files:

::

  $ starchcat chr1.starch chr2.starch ... chrY.starch > humanGenome.starch

The :ref:`starchcat` utility parses the metadata from each of the 23 inputs, determines what data to either simple copy or reprocess, and then it performs the merge. Cleanup is performed afterwards, as necessary, and the output is a brand new :ref:`starch` file, written to ``humanGenome.starch``.

.. note:: No filtering or processing is performed on extracted BED elements, before they are written to the final output. Thus, *it is possible for duplicate BED elements to occur*. 

   However, the final archive is sorted per :ref:`sort-bed` ordering, so that data extracted from this archive will be ready for use with BEDOPS utilities.

.. note:: When input archives contain data on disjoint chromosomes, use of :ref:`starchcat` is very efficient as data are simply copied, instead of extracted and re-compressed.

.. |--| unicode:: U+2013   .. en dash
.. |---| unicode:: U+2014  .. em dash, trimming surrounding whitespace
   :trim: