1 Introduction to PyRanges

PyRanges are collections of intervals that support comparison operations (like overlap and intersection) and other methods that are useful for genomic analyses. The ranges can have an arbitrary number of meta-data fields, i.e. columns associated with them.

The data in PyRanges objects are stored in a pandas dataframe. This means the vast Python ecosystem for high-performance scientific computing is available to manipulate the data in PyRanges-objects.

import pyranges as pr
from pyranges import PyRanges
import pandas as pd
from io import StringIO
f1 = """Chromosome Start End Score Strand
chr1 4 7 23.8 +
chr1 6 11 0.13 -
chr2 0 14 42.42 +"""
df1 = pd.read_table(StringIO(f1), sep="\s+")
gr1 = PyRanges(df1)

Now we can subset the PyRange in various ways:

print(gr1)
## +--------------+-----------+-----------+-------------+--------------+
## | Chromosome   |     Start |       End |       Score | Strand       |
## | (category)   |   (int64) |   (int64) |   (float64) | (category)   |
## |--------------+-----------+-----------+-------------+--------------|
## | chr1         |         4 |         7 |       23.8  | +            |
## | chr1         |         6 |        11 |        0.13 | -            |
## | chr2         |         0 |        14 |       42.42 | +            |
## +--------------+-----------+-----------+-------------+--------------+
## PyRanges object has 3 sequences from 2 chromosomes.
print(gr1["chr1", 0:5])
## +--------------+-----------+-----------+-------------+--------------+
## | Chromosome   |     Start |       End |       Score | Strand       |
## | (category)   |   (int64) |   (int64) |   (float64) | (category)   |
## |--------------+-----------+-----------+-------------+--------------|
## | chr1         |         4 |         7 |        23.8 | +            |
## +--------------+-----------+-----------+-------------+--------------+
## PyRanges object has 1 sequences from 1 chromosomes.
print(gr1["chr1", "-", 6:100])
## +--------------+-----------+-----------+-------------+--------------+
## | Chromosome   |     Start |       End |       Score | Strand       |
## | (category)   |   (int64) |   (int64) |   (float64) | (category)   |
## |--------------+-----------+-----------+-------------+--------------|
## | chr1         |         6 |        11 |        0.13 | -            |
## +--------------+-----------+-----------+-------------+--------------+
## PyRanges object has 1 sequences from 1 chromosomes.
print(gr1.Score)
## 0    23.80
## 1     0.13
## 2    42.42
## Name: Score, dtype: float64

And we can perform comparison operations with two PyRanges:

f2 = """Chromosome Start End Score Strand
chr1 5 6 -0.01 -
chr1 9 12 200 +
chr3 0 14 21.21 -"""
df2 = pd.read_table(StringIO(f2), sep="\s+")
gr2 = PyRanges(df2)
print(gr2)
## +--------------+-----------+-----------+-------------+--------------+
## | Chromosome   |     Start |       End |       Score | Strand       |
## | (category)   |   (int64) |   (int64) |   (float64) | (category)   |
## |--------------+-----------+-----------+-------------+--------------|
## | chr1         |         5 |         6 |       -0.01 | -            |
## | chr1         |         9 |        12 |      200    | +            |
## | chr3         |         0 |        14 |       21.21 | -            |
## +--------------+-----------+-----------+-------------+--------------+
## PyRanges object has 3 sequences from 2 chromosomes.
print(gr1.intersection(gr2, strandedness="opposite"))
## +--------------+-----------+-----------+-------------+--------------+
## | Chromosome   |     Start |       End |       Score | Strand       |
## | (category)   |   (int64) |   (int64) |   (float64) | (category)   |
## |--------------+-----------+-----------+-------------+--------------|
## | chr1         |         5 |         6 |       23.8  | +            |
## | chr1         |         9 |        11 |        0.13 | -            |
## +--------------+-----------+-----------+-------------+--------------+
## PyRanges object has 2 sequences from 1 chromosomes.
print(gr1.intersection(gr2, strandedness=False))
## +--------------+-----------+-----------+-------------+--------------+
## | Chromosome   |     Start |       End |       Score | Strand       |
## | (category)   |   (int64) |   (int64) |   (float64) | (category)   |
## |--------------+-----------+-----------+-------------+--------------|
## | chr1         |         9 |        11 |        0.13 | -            |
## | chr1         |         5 |         6 |       23.8  | +            |
## +--------------+-----------+-----------+-------------+--------------+
## PyRanges object has 2 sequences from 1 chromosomes.

There are also convenience methods for single PyRanges:

# The range objects also contain other convenience functions.
print(gr1.cluster())
## +--------------+-----------+-----------+
## | Chromosome   |     Start |       End |
## | (category)   |   (int64) |   (int64) |
## |--------------+-----------+-----------|
## | chr1         |         4 |        11 |
## | chr2         |         0 |        14 |
## +--------------+-----------+-----------+
## PyRanges object has 2 sequences from 2 chromosomes.

The underlying dataframe can always be accessed:

print(gr1.df)
##   Chromosome  Start  End  Score Strand
## 0       chr1      4    7  23.80      +
## 1       chr1      6   11   0.13      -
## 2       chr2      0   14  42.42      +