6.3.1.1. sort-bed¶
The sort-bed
utility sorts BED files of any size, even larger than system memory. BED files that are in lexicographic-chromosome order allow BEDOPS utilities to work efficiently with data from any species without software modifications. Further, sorted files can be traversed very quickly.
Sorted BED order is defined first by lexicographic chromosome order, then ascending integer start coordinate order, and finally by ascending integer end coordinate order. To make the sort order unambiguous, a lexicographical sort is applied on fourth and subsequent columns, where present in the input BED dataset.
Other utilities in the BEDOPS suite require data in sorted order as described. You only need to sort once: BEDOPS utilities all read and write data in sorted order.
6.3.1.1.1. Migrating older BED and Starch files¶
The utility update-sort-bed-migrate-candidates
recursively locates BED and pre-v2.4.20 Starch files in the specified parent directory, tests if they require re-sorting to conform to the updated, post-v2.4.20 ‘sort-bed’ order, and offers actions to log candidate files, or immediately apply a resort action that is performed locally or via a SLURM-managed cluster.
The convenience utilities update-sort-bed-slurm
and update-sort-bed-starch-slurm
update the sort order of BED or Starch files sorted with pre-v2.4.20 sort-bed
via a SLURM-based cluster. See update-sort-bed-slurm --help
or update-sort-bed-starch-slurm --help
for more details. These utilities can be used standalone or in conjunction with the update-sort-bed-migrate-candidates
utility.
6.3.1.1.2. Inputs and outputs¶
6.3.1.1.2.1. Input¶
The sort-bed
utility requires one or more three-column BED file(s). Support for common headers (such as UCSC BED track headers) is included, although headers will be stripped from the output.
6.3.1.1.2.2. Output¶
The sort-bed
utility sends sorted BED data to standard output, which can be redirected to a file or piped to other utilities, including core BEDOPS utilities like bedops and bedmap. Sort order is defined by a lexicographical sort on chromosome name, a numerical sort on start coordinates, a numerical sort on stop coordinates where there are start matches, and finally a lexicographical sort on the remainder of the BED element (if additional columns are present). Additional options may be specified to print only unique or duplicate elements, or check the sort order of input.
6.3.1.1.3. Usage¶
The --help
option is fairly basic, but describes the usage:
sort-bed
citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
version: 2.4.41 (typical)
authors: Scott Kuehn
USAGE: sort-bed [--help] [--version] [--check-sort] [--max-mem <val>] [--tmpdir <path>] [--unique] [--duplicates] <file1.bed> <file2.bed> <...>
Sort BED file(s).
May use '-' to indicate stdin.
Results are sent to stdout.
<val> for --max-mem may be 8G, 8000M, or 8000000000 to specify 8 GB of memory.
--tmpdir is useful only with --max-mem.
--unique can be used to print only unique BED elements (similar to "sort -u").
--duplicates can be used to print only duplicated or repeated elements (similar to "uniq -d").
A simple example of using sort-bed
would be:
$ sort-bed unsortedData.bed > sortedData.bed
The sort-bed
program efficiently sorts BED inputs. By default, all input records are read into system memory and sorted. If your BED dataset is larger than available system memory, use the --max-mem
option to limit the amount of memory sort-bed
uses to do its work:
$ sort-bed --max-mem 2G reallyHugeUnsortedData.bed > reallyHugeSortedData.bed
This option allows sort-bed
to scale to input of any size.
The --tmpdir
option allows specification of an alternative temporary directory, when used in conjunction with --max-mem
option. This is useful if the host operating system’s standard temporary directory (e.g., /tmp
on Linux or OS X) does not have sufficient space to hold intermediate results.
For example, to use the current working directory to store temporary data, one could use the $PWD
environment variable:
$ sort-bed --max-mem 2G --tmpdir $PWD reallyHugeUnsortedData.bed > reallyHugeSortedData.bed
Use of the --check-sort
option returns a message if the input is sorted, or not.
The --unique
and --duplicates
options print only unique or duplicated elements in sorted output, respectively. These options mimic sort -u
and uniq -d
commands, respectively.