PAIRTOOLS(1) | pairtools | PAIRTOOLS(1) |
pairtools - pairtools Documentation
pairtools is a simple and fast command-line framework to process sequencing data from a Hi-C experiment. pairtools perform various operations on Hi-C pairs and occupy the middle position in a typical Hi-C data processing pipeline:
pairtools aim to be an all-in-one tool for processing Hi-C pairs, and can perform following operations:
pairtools produce .pairs files compliant with the 4DN standard.
The full list of available pairtools:
Pairtool | Description |
dedup | Find and remove PCR/optical duplicates. |
filterbycov | Remove pairs from regions of high coverage. |
flip | Flip pairs to get an upper-triangular matrix. |
markasdup | Tag pairs as duplicates. |
merge | Merge sorted .pairs/.pairsam files. |
parse | Find ligation junctions in .sam, make .pairs. |
phase | Phase pairs mapped to a diploid genome. |
restrict | Assign restriction fragments to pairs. |
select | Select pairs according to some condition. |
sort | Sort a .pairs/.pairsam file. |
split | Split a .pairsam file into .pairs and .sam. |
stats | Calculate pairs statistics. |
Contents:
Install pairtools and all of its dependencies using the conda package manager and the bioconda channel for bioinformatics software.
$ conda install -c conda-forge -c bioconda pairtools
Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:
$ mkdir /tmp/test-pairtools $ cd /tmp/test-pairtools $ wget https://github.com/mirnylab/distiller-test-data/raw/master/bam/MATalpha_R1.bam
Additionally, we will need a .chromsizes file, a TAB-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping:
$ wget https://raw.githubusercontent.com/mirnylab/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes
With pairtools parse, we can convert paired-end sequence alignments stored in .sam/.bam format into .pairs, a TAB-separated table of Hi-C ligation junctions:
$ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam
Inspect the resulting table:
$ less MATalpha_R1.pairs.gz
We highly recommend using the conda package manager to install pre-compiled pairtools together with all its dependencies. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.
With conda, you can install pre-compiled pairtools and all of its dependencies from the bioconda channel:
$ conda install -c conda-forge -c bioconda pairtools
Alternatively, compile and install pairtools and its Python dependencies from PyPI using pip:
$ pip install pairtools
Finally, you can install the latest development version of pairtools from github. First, make a local clone of the github repository:
$ git clone https://github.com/mirnylab/pairtools
Then, you can compile and install pairtools in the development mode, which installs the package without moving it to a system folder and thus allows immediate live-testing any changes in the python code. Please, make sure that you have cython installed!
$ cd pairtools $ pip install -e ./
Hi-C experiments aim to measure the frequencies of contacts between all pairs of loci in the genome. In these experiments, the spacial structure of chromosomes if first fixed with formaldehyde crosslinks, after which DNA is partially digested with restriction enzymes and then re-ligated back. Then, DNA is shredded into smaller pieces, released from nucleus, sequenced and aligned to the reference genome. The resulting sequence alignments reveal if DNA molecules were formed through ligations between DNA from different locations in the genome. These ligation events imply that ligated loci were close to each other when the ligation enzyme was active, i.e. they formed "a contact".
pairtools parse detects ligation events in the aligned sequences of DNA molecules formed in Hi-C experiments and reports them in the .pairs/.pairsam format.
Throughout this document we will be using the same visual language to describe how DNA sequences (in the .fastq format) are transformed into sequence alignments (.sam/.bam) and into ligation events (.pairs).
Short-read sequencing determines the sequences of the both ends (or, sides) of DNA molecules (typically 50-300 bp), producing read pairs in .fastq format (shown in the first row on the figure above). In such reads, base pairs are reported from the tips inwards, which is also defined as the 5'->3' direction (in accordance of the 5'->3' direction of the DNA strand that sequence of the corresponding side of the read).
Alignment software maps both reads of a pair to the reference genome, producing alignments, i.e. segments of the reference genome with matching sequences. Typically, there will be only two alignments per read pair, one on each side. But, sometimes, the parts of one or both sides may map to different locations on the genome, producing more than two alignments per DNA molecule (see Multiple ligations (walks)).
pairtools parse converts alignments into ligation events (aka Hi-C pairs aka pairs). In the simplest case, when each side has only one unique alignment (i.e. the whole side maps to a single unique segment of the genome), for each side, we report the chromosome, the genomic position of the outer-most (5') aligned base pair and the strand of the reference genome that the read aligns to. pairtools parse assigns to such pairs the type UU (unique-unique).
Sometimes, one side or both sides of a read pair may not align to the reference genome:
In this case, pairtools parse fills in the chromosome of the corresponding side of Hi-C pair with !, the position with 0 and the strand with -. Such pairs are reported as type NU (null-unique, when the other side has a unique alignment) or NN (null-null, when both sides lack any alignment).
Similarly, when one or both sides map to many genome locations equally well (i.e. have non-unique, or, multi-mapping alignments), pairtools parse reports the corresponding sides as (chromosome= !, position= 0, strand= -) and type MU (multi-unique) or MM (multi-multi) or NM (null-multi), depending on the type of the alignment on the other side.
pairtools parse calls an alignment to be multi-mapping when its MAPQ score (which depends on the scoring gap between the two best candidate alignments for a segment) is equal or greater than the value specified with the --min-mapq flag (by default, 1).
Finally, a read pair may contain more than two alignments:
Molecules like these typically form via multiple ligation events and we call them walks [1]. Currently, pairtools parse does not process such molecules and tags them as type WW.
Reads that are only partially aligned to the genome can be interpreted in two different ways. One possibility is to assume that this molecule was formed via at least two ligations (i.e. it's a walk) but the non-aligned part (a gap) was missing from the reference genome for one reason or another. Another possibility is to simply ignore this gap (for example, because it could be an insertion or a technical artifact), thus assuming that our molecule was formed via a single ligation and has to be reported:
Both options have their merits, depending on a dataset, quality of the reference genome and sequencing. pairtools parse ignores shorter gaps and keeps longer ones as "null" alignments. The maximal size of ignored gaps is set by the --max-inter-align-gap flag (by default, 20bp).
Importantly, some of DNA molecules containing only one ligation junction may still end up with three alignments:
A molecule formed via a single ligation gets three alignments when one of the two ligated DNA pieces is shorter than the read length, such that that read on the corresponding side sequences through the ligation junction and into the other piece [2]. The amount of such molecules depends on the type of the restriction enzyme, the typical size of DNA molecules in the Hi-C library and the read length, and sometimes can be considerable.
pairtools parse detects such molecules and rescues them (i.e. changes their type from a walk to a single-ligation molecule). It tests walks with three aligments using three criteria:
Sometimes, the "inner" alignment on the chimeric side can be non-unique or "null" (i.e. when the unmapped segment is longer than --max-inter-align-gap, as described in Interpreting gaps between alignments). pairtools parse ignores such alignments altogether and thus rescues such walks as well.
In order to enable efficient random access to Hi-C pairs, we flip and sort pairs. After sorting, interactions become arranged in the order of their genomic position, such that, for any given pair of regions, we easily find and extract all of their interactions. And, after flipping, all artificially duplicated molecules (either during PCR or in optical sequencing) end up in adjacent rows in sorted lists of interactions, such that we can easily identify and remove them.
pairtools sort arrange pairs in the order of (chrom1, chrom2, pos1, pos2). This order is also known as block sorting, because all pairs between any given pair of chromosomes become grouped into one continuous block. Additionally, pairtools sort also sorts pairs with identical positions by pair_type. This does not really do much for mapped reads, but it nicely splits unmapped reads into blocks of null-mapped and multi-mapped reads.
We note that there is an alternative to block sorting, called row sorting, where pairs are sorted by (chrom1, pos1, chrom2, pos2). In pairtools sort, we prefer block-sorting since it cleanly separates cis interactions from trans ones and thus is a more optimal solution for typical use cases.
In a typical paired-end experiment, side1 and side2 of a DNA molecule are defined by the order in which they got sequenced. Since this order is essentially random, any given Hi-C pair, e.g. (chr1, 1.1Mb; chr2, 2.1Mb), may appear in a reversed orientation, i.e. (chr2, 2.1Mb; chr1, 1.1Mb). If we were to preserve this order of sides, interactions between same loci would appear in two different locations of the sorted pair list, which would complicate finding PCR/optical duplicates.
To ensure that Hi-C pairs with similar coordinates end up in the same location of the sorted list, we flip pairs, i.e. we choose side1 as the side with the lowest genomic coordinate. Thus, after flipping, for trans pairs (chrom1!=chrom2), order(chrom1)<order(chrom2); and for cis pairs (chrom1==chrom2), pos1<=pos2. In a matrix representation, flipping is equal to reflecting the lower triangle of the Hi-C matrix onto its upper triangle, such that the resulting matrix is upper-triangular.
In pairtools, flipping is done during parsing - that's why pairtools parse requires a .chromsizes file that specifies the order of chromosomes for flipping. Importantly, pairtools parse also flips one-sided pairs such that side1 is always unmapped; and unmapped pairs such that side1 always has a "poorer" mapping type (i.e. null-mapping<multi-mapping).
Importantly, the order of chromosomes for sorting and flipping can be different. Specifically, pairtools sort uses the lexicographic order for chromosomes (chr1, chr10, chr11, ..., chr2, chr21,...) instead of the "natural" order (chr1, chr2, chr3, ...); at the same time, flipping is done in pairsamtools parse using the chromosomal order specified by the user.
pairtools sort uses the lexicographic order for sorting chromosomes. This order is used universally to sorting strings in all languages and tools [1], which makes it easy to design tools for indexing and searching in sorted pair lists.
At the same time, pairtools parse uses a custom user-provided chromosomal order to flip pairs. For performance considerations, for flipping, we recommend ordering chromosomes in a way that will be used in plotting contact maps.
.pairs is a simple tabular format for storing DNA contacts detected in a Hi-C experiment. The detailed .pairs specification is defined by the 4DN Consortium.
The body of a .pairs contains a table with a variable number of fields separated by a "\t" character (a horizontal tab). The .pairs specification fixes the content and the order of the first seven columns:
index | name | description |
1 | read_id | the ID of the read as defined in fastq files |
2 | chrom1 | the chromosome of the alignment on side 1 |
3 | pos1 | the 1-based genomic position of the outer-most (5') mapped bp on side 1 |
4 | chrom2 | the chromosome of the alignment on side 2 |
5 | pos2 | the 1-based genomic position of the outer-most (5') mapped bp on side 2 |
6 | strand1 | the strand of the alignment on side 1 |
7 | strand2 | the strand of the alignment on side 2 |
A .pairs file starts with a header, an arbitrary number of lines starting with a "#" character. By convention, the header lines have a format of "#field_name: field_value". The .pairs specification mandates a few standard header lines (e.g., column names, chromosome order, sorting order, etc), all of which are automatically filled in by pairtools.
The entries of a .pairs file can be flipped and sorted. "Flipping" means that the sides 1 and 2 do not correspond to side1 and side2 in sequencing data. Instead, side1 is defined as the side with the alignment with a lower sorting index (using the lexographic order for chromosome names, followed by the numeric order for positions and the lexicographic order for pair types). This particular order of "flipping" is defined as "upper-triangular flipping", or "triu-flipping". Finally, pairs are typically block-sorted: i.e. first lexicographically by chrom1 and chrom2, then numerically by pos1 and pos2.
.pairs files produced by pairtools extend .pairs format in a few ways.
index | name | description |
8 | pair_type | the type of a Hi-C pair |
pairtools use a simple two-character notation to define all possible pair types, according to the quality of alignment of the two sides. The type of a pair can be defined unambiguously using the table below. To use this table, identify which side has an alignment of a "poorer" quality (unmapped < multimapped < unique alignment) and which side has a "better" alignment and find the corresponding row in the table.
. | Less informative alignment | More informative alignment | . | . | . | ||
>2 alignments | Mapped | Unique | Mapped | Unique | Pair type | Code | Sidedness |
✔ | ❌ | ❌ | ❌ | ❌ | walk-walk | WW | 0 [1] |
❌ | ❌ | ❌ | null | NN | 0 | ||
❌ | ❌ | ❌ | corrupt | XX | 0 [2]_ |
❌ | ❌ | ✔ | ❌ | null-multi | NM | 0 | |
✔ | ❌ | ✔ | ✔ | null-rescued | NR | 1 [3] | |
❌ | ❌ | ✔ | ✔ | null-unique | NU | 1 | |
❌ | ✔ | ❌ | ✔ | ❌ | multi-multi | MM | 0 |
✔ | ✔ | ❌ | ✔ | ✔ | multi-rescued | MR | 1 [3] |
❌ | ✔ | ❌ | ✔ | ✔ | multi-unique | MU | 1 |
✔ | ✔ | ✔ | ✔ | ✔ | rescued-unique | RU | 2 [3] |
✔ | ✔ | ✔ | ✔ | ✔ | unique-rescued | UR | 2 [3] |
❌ | ✔ | ✔ | ✔ | ✔ | unique-unique | UU | 2 |
❌ | ✔ | ✔ | ✔ | ✔ | duplicate | DD | 2 [4]_ |
pairtools also define .pairsam, a valid extension of the .pairs format. On top of the pairtools' flavor of .pairs, .pairsam format adds two extra columns containing the alignments from which the Hi-C pair was extracted:
index | name | description |
9 | sam1 | the sam alignment(s) on side 1; separate supplemental alignments by NEXT_SAM |
10 | sam2 | the sam alignment(s) on side 2; separate supplemental alignments by NEXT_SAM |
Note that, normally, the fields of a sam alignment are separated by a horizontal tab character (\t), which we already use to separate .pairs columns. To avoid confusion, we replace the tab character in sam entries stored in sam1 and sam2 columns with a UNIT SEPARATOR character (\031).
Finally, sam1 and sam2 can store multiple .sam alignments, separated by a string '\031NEXT_SAM\031'
Designing scientific software and formats requires making a multitude of tantalizing technical decisions and compromises. Often, the reasons behind a certain decision are non-trivial and convoluted, involving many factors. Here, we collect the notes and observations made during the desing stage of pairtools and provide a justification for most non-trivial decisions. We hope that this document will elucidate the design of pairtools and may prove useful to developers in their future projects.
The motivation behind some of the technical decisions in the pairtools' flavor of .pairs/.pairsam:
Having said that, text formats have many downsides - they are bulky when not compressed, compression and parsing requires extra computational resources, they cannot be modified in place and random access requires extra tools. In the future, we plan to develop a binary format based on existing container formats, which would mitigate these downsides.
Mirny Lab
2017-2020, Mirny Lab
December 7, 2020 | 0.3.0 |