6.3.3.8. psl2bed¶
The psl2bed
script converts 0-based, half-open [start-1, end)
Pattern Space Layout (PSL) to sorted, 0-based, half-open [start-1, end)
extended BED-formatted data.
For convenience, we also offer psl2starch
, which performs the extra step of creating a Starch-formatted archive.
6.3.3.8.1. Dependencies¶
The psl2bed
script requires convert2bed. The psl2starch
script requires starch. Both dependencies are part of a typical BEDOPS installation.
This script is also dependent on input that follows the PSL specification.
Tip
Conversion of data which are PSL-like, but which do not follow the specification can cause parsing issues. If you run into problems, please check that your input follows the PSL specification.
6.3.3.8.2. Source¶
The psl2bed
and psl2starch
conversion scripts are part of the binary and source downloads of BEDOPS. See the Installation documentation for more details.
6.3.3.8.3. Usage¶
The psl2bed
script parses PSL from standard input and prints sorted BED to standard output. The psl2starch
script uses an extra step to parse GFF to a compressed BEDOPS Starch-formatted archive, which is also directed to standard output.
The header data of a headered PSL file is usually discarded, unless you add the --keep-header
option. In this case, BED elements are created from these data, using the chromosome name _header
to denote content. Line numbers are specified in the start and stop coordinates, and unmodified header data are placed in the fourth column (ID field).
If your data contains a record with multiple blocks (block count
is greater than one, and the tStarts
field has multiple target start positions), you can use the --split
option to print that record to separate BED elements, each with a start position defined by tStarts
and a length defined by the associated value in the blockSizes
string.
Tip
By default, all conversion scripts now output sorted BED data ready for use with BEDOPS utilities. If you do not want to sort converted output, use the --do-not-sort
option. Run the script with the --help
option for more details.
Tip
If you are sorting data larger than system memory, use the --max-mem
option to limit sort memory usage to a reasonable fraction of available memory, e.g., --max-mem 2G
or similar. See --help
for more details.
6.3.3.8.4. Example¶
To demonstrate these scripts, we use a sample PSL input called foo.psl
(see the Downloads section to grab this file).
psLayout version 3
match mis- rep. N's Q gap Q gap T gap T gap strand Q Q Q Q T T T T block blockSizes qStarts tStarts
match match count bases count bases name size start end name size start end count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
35 0 0 0 0 0 0 0 + foo 50 15 50 chrX 155270560 40535836 40535871 1 35, 15, 40535836,
34 2 0 0 0 0 0 0 + foo 50 14 50 chrX 155270560 68019028 68019064 1 36, 14, 68019028,
33 2 0 0 0 0 0 0 + foo 50 14 49 chrX 155270560 43068135 43068170 1 35, 14, 43068135,
35 2 0 0 0 0 0 0 + foo 50 13 50 chr8 146364022 131572122 131572159 1 37, 13, 131572122,
30 0 0 0 0 0 0 0 + foo 50 14 44 chr6 171115067 127685756 127685786 1 30, 14, 127685756,
30 0 0 0 0 0 0 0 + foo 50 14 44 chr6 171115067 93161871 93161901 1 30, 14, 93161871,
31 0 0 0 0 0 0 0 + foo 50 13 44 chr5 180915260 119897315 119897346 1 31, 13, 119897315,
30 0 0 0 0 0 0 0 + foo 50 14 44 chr5 180915260 1232.4.41 1232.4.415 1 30, 14, 1232.4.41,
...
We can convert it to sorted BED data in the following manner:
$ psl2bed < foo.psl
chr1 30571100 30571135 foo 50 - 35 0 0 0 0 0 0 0 15 50 249250621 1 35, 0, 30571100,
chr1 69592160 69592195 foo 50 - 34 1 0 0 0 0 0 0 15 50 249250621 1 35, 0, 69592160,
chr1 107200050 107200100 foo 50 + 50 0 0 0 0 0 0 0 0 50 249250621 1 50, 0, 107200050,
chr11 12618347 12618389 foo 50 + 39 3 0 0 0 0 0 0 8 50 135006516 1 42, 8, 12618347,
chr11 32933028 32933063 foo 50 + 35 0 0 0 1 1 0 0 8 44 135006516 2 4,31, 8,13, 32933028,32933032,
chr11 80116421 80116457 foo 50 + 35 1 0 0 0 0 0 0 14 50 135006516 1 36, 14, 80116421,
chr11 133952291 133952327 foo 50 + 34 2 0 0 0 0 0 0 14 50 135006516 1 36, 14, 133952291,
chr13 99729482 99729523 foo 50 + 39 2 0 0 0 0 0 0 8 49 115169878 1 41, 8, 99729482,
chr13 111391852 111391888 foo 50 + 34 2 0 0 0 0 0 0 14 50 115169878 1 36, 14, 111391852,
chr16 8149657 8149694 foo 50 + 36 1 0 0 0 0 0 0 13 50 90354753 1 37, 13, 8149657,
...
As you see here, the header data of a headered PSL file is discarded, unless you add the --keep-header
option. In this case, BED elements are created from these data, using the chromosome name _header
to denote content. Line numbers are specified in the start and stop coordinates, and unmodified header data are placed in the fourth column (ID field).
Here is a demonstration of conversion of the same headered input, adding the --keep-header
option:
$ psl2bed --keep-header < foo.psl
_header 0 1 psLayout version 3
_header 1 2
_header 2 3 match mis- rep. N's Q gap Q gap T gap T gap strand Q Q Q Q T T T T block blockSizes qStarts tStarts
_header 3 4 match match count bases count bases name size start end name size start end count
_header 4 5 ---------------------------------------------------------------------------------------------------------------------------------------------------------------
chr1 30571100 30571135 foo 50 - 35 0 0 0 0 0 0 0 15 50 249250621 1 35, 0, 30571100,
chr1 69592160 69592195 foo 50 - 34 1 0 0 0 0 0 0 15 50 249250621 1 35, 0, 69592160,
chr1 107200050 107200100 foo 50 + 50 0 0 0 0 0 0 0 0 50 249250621 1 50, 0, 107200050,
chr11 12618347 12618389 foo 50 + 39 3 0 0 0 0 0 0 8 50 135006516 1 42, 8, 12618347,
chr11 32933028 32933063 foo 50 + 35 0 0 0 1 1 0 0 8 44 135006516 2 4,31, 8,13, 32933028,32933032,
chr11 80116421 80116457 foo 50 + 35 1 0 0 0 0 0 0 14 50 135006516 1 36, 14, 80116421,
chr11 133952291 133952327 foo 50 + 34 2 0 0 0 0 0 0 14 50 135006516 1 36, 14, 133952291,
chr13 99729482 99729523 foo 50 + 39 2 0 0 0 0 0 0 8 49 115169878 1 41, 8, 99729482,
chr13 111391852 111391888 foo 50 + 34 2 0 0 0 0 0 0 14 50 115169878 1 36, 14, 111391852,
chr16 8149657 8149694 foo 50 + 36 1 0 0 0 0 0 0 13 50 90354753 1 37, 13, 8149657,
...
With this option, the psl2bed
and psl2starch
scripts are completely “non-lossy”. Use of awk
or other scripting tools can munge these data back into a PSL-formatted file.
This example PSL file contains one record with a block count of 2. If we were to add the --split
option, this record would be split into two separate BED elements that have start positions 32933028
and 32933032
, with lengths 4
and 31
, respectively. These elements fall within the genomic range already defined by the tStart
and tEnd
fields (32933028
and 32933063
).
Note
The psl2bed
and psl2starch
scripts work with headered or headerless PSL data.
Note
By default, the psl2bed
and psl2starch
scripts assume that PSL data do not need splitting. If you expect your data to contain multiple blocks, add the --split
option.
6.3.3.8.5. Column mapping¶
In this section, we describe how PSL columns are mapped to BED columns. We start with the first six UCSC BED columns as follows:
PSL field |
BED column index |
BED field |
---|---|---|
tName |
1 |
chromosome |
tStart(*) |
2 |
start |
tEnd(*) |
3 |
stop |
qName |
4 |
id |
matches |
5 |
score |
strand |
6 |
strand |
The remaining PSL columns are mapped, in order, to the remaining columns of the BED output:
PSL field |
BED column index |
BED field |
---|---|---|
qSize |
7 |
|
misMatches |
8 |
|
repMatches |
9 |
|
nCount |
10 |
|
qNumInsert |
11 |
|
qBaseInsert |
12 |
|
tNumInsert |
13 |
|
tBaseInsert |
14 |
|
qStart |
15 |
|
qEnd |
16 |
|
tSize |
17 |
|
blockCount |
18 |
|
blockSizes |
19 |
|
qStarts |
20 |
|
tStarts |
21 |
This is a lossless mapping. Because we have mapped all columns, we can translate converted BED data back to headerless PSL with a simple awk
statement that permutes columns to PSL-based ordering:
$ awk 'BEGIN { OFS = "\t" } \
{ \
print $5" "$8" "$9" "$10" "$11" "$12" "$13" "$14" "$6" "$4" "$7" "$15" "$16" "$1" "$17" "$2" "$3" "$18" "$19" "$20" "$21 }' converted.bed > original.psl
In the case where the --split
option is added, the tStart
and tEnd
fields are replaced with each of the values in the larger tStarts
string, added to the respective values in the larger blockSizes
string. This is still a lossless conversion, but modifications to the awk
script printed above would be required to rebuild the original PSL.
6.3.3.8.6. Downloads¶
Sample PSL dataset:
foo.psl