6.3.3.2. gff2bed

The gff2bed script converts 1-based, closed [start, end] General Feature Format v3 (GFF3) to unsorted, 0-based, half-open [start-1, end) extended BED-formatted data.

For convenience, we also offer gff2starch, which performs the extra step of creating a Starch-formatted archive.

6.3.3.2.1. Dependencies

The gff2bed script requires Python, version 2.5 or greater.

This script is also dependent on input that follows the GFF3 specification. A GFF3-format validator is available here to ensure your input follows specification.

Tip

Conversion of data which are GFF-like, but which do not follow the specification can cause IOError and other runtime exceptions. If you run into problems, please check that your input follows the GFF specification.

6.3.3.2.2. Source

The gff2bed and gff2starch conversion scripts are part of the binary and source downloads of BEDOPS. See the Installation documentation for more details.

6.3.3.2.3. Usage

The gff2bed script parses GFF3 from standard input and prints sorted BED to standard output. The gff2starch script uses an extra step to parse GFF to a compressed BEDOPS Starch-formatted archive, which is also directed to standard output.

Tip

By default, all conversion scripts now output sorted BED data ready for use with BEDOPS utilities. If you do not want to sort converted output, use the --do-not-sort option. Run the script with the --help option for more details.

Tip

If you are sorting data larger than system memory, use the --max-mem option to limit sort memory usage to a reasonable fraction of available memory, e.g., --max-mem 2G or similar. See --help for more details.

6.3.3.2.4. Example

To demonstrate these scripts, we use a sample GFF input called foo.gff (see the Downloads section to grab this file).

##gff-version 3
chr1    Canada  exon    1300    1300    .       +       .       ID=exon00001;score=1
chr1    USA     exon    1050    1500    .       -       0       ID=exon00002;Ontology_term="GO:0046703";Ontology_term="GO:0046704"
chr1    Canada  exon    3000    3902    .       ?       2       ID=exon00003;score=4;Name=foo
chr1    .       exon    5000    5500    .       .       .       ID=exon00004;Gap=M8 D3 M6 I1 M6
chr1    .       exon    7000    9000    10      +       1       ID=exon00005;Dbxref="NCBI_gi:10727410"

We can convert it to sorted BED data in the following manner:

$ gff2bed < foo.gff3
chr1    1049    1500    exon00002       .       -       USA     exon    0       ID=exon00002;Ontology_term="GO:0046703";Ontology_term="GO:0046704"
chr1    1299    1300    exon00001       .       +       Canada  exon    .       ID=exon00001;score=1;zeroLengthInsertion=True
chr1    2999    3902    exon00003       .       ?       Canada  exon    2       ID=exon00003;score=4;Name=foo
chr1    4999    5500    exon00004       .       .       .       exon    .       ID=exon00004;Gap=M8 D3 M6 I1 M6
chr1    6999    9000    exon00005       10      +       .       exon    1       ID=exon00005;Dbxref="NCBI_gi:10727410"

Note

GFF3 data that have trailing semi-colons on attributes, e.g.:

Parent=ATMG00060.1,ATMG00060.1-Protein;

will cause IndexError: list index out of range errors when used with this conversion script.

The easiest fix is to use awk to strip the trailing delimiter and pipe the fixed results to the conversion script, i.e.:

$ awk '{gsub(/;$/,"");print}' badFoo.gff | gff2bed - > goodFoo.bed

This issue is also discussed on the BEDOPS User Forum.

Note

Zero-length insertion elements are given an extra attribute called zeroLengthInsertion which lets a BED-to-GFF or other parser know that the element will require conversion back to a right-closed element [a, b], where a and b are equal.

Note

Note the conversion from 1- to 0-based coordinate indexing, in the transition from GFF3 to BED. BEDOPS supports operations on input with any coordinate indexing, but the coordinate change made here is believed to be convenient for most end users.

6.3.3.2.5. Downloads