4 Subsetting PyRanges

There are many ways to subset a PyRanges object. Each returns a new PyRanges object and does not change the old one.

For data exploration, the functions head, tail and sample (random choice without replacment) are convenient. They take an argument n to denote how many entries you want.

import pyranges as pr
gr = pr.data.chipseq()
print(gr.sample())
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   |     Start |       End | Name       |     Score | Strand       |
## | (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr2         | 223891559 | 223891584 | U0         |         0 | +            |
## | chr2         | 205599166 | 205599191 | U0         |         0 | +            |
## | chr4         | 141494539 | 141494564 | U0         |         0 | +            |
## | chr6         |  40587449 |  40587474 | U0         |         0 | -            |
## | chr12        |  61116050 |  61116075 | U0         |         0 | +            |
## | chr16        |  77874004 |  77874029 | U0         |         0 | -            |
## | chr18        |  17324817 |  17324842 | U0         |         0 | +            |
## | chr19        |  41397102 |  41397127 | U0         |         0 | +            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 8 rows and 6 columns from 7 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
print(gr.tail(4))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   |     Start |       End | Name       |     Score | Strand       |
## | (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chrY         |  15224235 |  15224260 | U0         |         0 | -            |
## | chrY         |  13517892 |  13517917 | U0         |         0 | -            |
## | chrY         |   8010951 |   8010976 | U0         |         0 | -            |
## | chrY         |   7405376 |   7405401 | U0         |         0 | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 4 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

By subsetting with a list you can select one or more columns:

import pyranges as pr
gr = pr.data.chipseq()
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | 0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | 0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | U0         | 0         | -            |
## | chrY         | 13517892  | 13517917  | U0         | 0         | -            |
## | chrY         | 8010951   | 8010976   | U0         | 0         | -            |
## | chrY         | 7405376   | 7405401   | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
print(gr[["Name"]])
## +--------------+-----------+-----------+------------+--------------+
## | Chromosome   | Start     | End       | Name       | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (category)   |
## |--------------+-----------+-----------+------------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | +            |
## | ...          | ...       | ...       | ...        | ...          |
## | chrY         | 15224235  | 15224260  | U0         | -            |
## | chrY         | 13517892  | 13517917  | U0         | -            |
## | chrY         | 8010951   | 8010976   | U0         | -            |
## | chrY         | 7405376   | 7405401   | U0         | -            |
## +--------------+-----------+-----------+------------+--------------+
## Stranded PyRanges object has 10,000 rows and 5 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

By subsetting using a boolean vector, you can get specific rows:

import pyranges as pr
cpg = pr.data.cpg()
print(cpg[cpg.CpG > 50])
## +--------------+-----------+-----------+-----------+
## | Chromosome   | Start     | End       | CpG       |
## | (category)   | (int32)   | (int32)   | (int64)   |
## |--------------+-----------+-----------+-----------|
## | chrX         | 64181     | 64793     | 62        |
## | chrX         | 69133     | 70029     | 100       |
## | chrX         | 148685    | 149461    | 85        |
## | chrX         | 166504    | 167721    | 96        |
## | ...          | ...       | ...       | ...       |
## | chrY         | 21154603  | 21155040  | 61        |
## | chrY         | 21238448  | 21240005  | 133       |
## | chrY         | 26351343  | 26352316  | 76        |
## | chrY         | 27610115  | 27611088  | 76        |
## +--------------+-----------+-----------+-----------+
## Unstranded PyRanges object has 530 rows and 4 columns from 2 chromosomes.
## For printing, the PyRanges was sorted on Chromosome.

By using strings, tuples and slices, you can subset the PyRanges based on position:

Chromosome only

print(gr["chrX"])
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chrX         | 13843759  | 13843784  | U0         | 0         | +            |
## | chrX         | 114673546 | 114673571 | U0         | 0         | +            |
## | chrX         | 131816774 | 131816799 | U0         | 0         | +            |
## | chrX         | 45504745  | 45504770  | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrX         | 146694149 | 146694174 | U0         | 0         | -            |
## | chrX         | 5044527   | 5044552   | U0         | 0         | -            |
## | chrX         | 15281263  | 15281288  | U0         | 0         | -            |
## | chrX         | 120273723 | 120273748 | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 282 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

Chromosome and Strand

print(gr["chrX", "-"])
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chrX         | 41852946  | 41852971  | U0         | 0         | -            |
## | chrX         | 69979838  | 69979863  | U0         | 0         | -            |
## | chrX         | 34824145  | 34824170  | U0         | 0         | -            |
## | chrX         | 132354117 | 132354142 | U0         | 0         | -            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrX         | 146694149 | 146694174 | U0         | 0         | -            |
## | chrX         | 5044527   | 5044552   | U0         | 0         | -            |
## | chrX         | 15281263  | 15281288  | U0         | 0         | -            |
## | chrX         | 120273723 | 120273748 | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 151 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

Chromosome and Slice

print(gr["chrX", 150000000:160000000])
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   |     Start |       End | Name       |     Score | Strand       |
## | (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chrX         | 151324943 | 151324968 | U0         |         0 | +            |
## | chrX         | 152902449 | 152902474 | U0         |         0 | +            |
## | chrX         | 153632850 | 153632875 | U0         |         0 | +            |
## | chrX         | 153874106 | 153874131 | U0         |         0 | +            |
## | chrX         | 150277236 | 150277261 | U0         |         0 | -            |
## | chrX         | 151277790 | 151277815 | U0         |         0 | -            |
## | chrX         | 153037423 | 153037448 | U0         |         0 | -            |
## | chrX         | 153255924 | 153255949 | U0         |         0 | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 8 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

Chromosome, Strand and Slice

print(gr["chrX", "-", 150000000:160000000])
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   |     Start |       End | Name       |     Score | Strand       |
## | (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chrX         | 150277236 | 150277261 | U0         |         0 | -            |
## | chrX         | 151277790 | 151277815 | U0         |         0 | -            |
## | chrX         | 153037423 | 153037448 | U0         |         0 | -            |
## | chrX         | 153255924 | 153255949 | U0         |         0 | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 4 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

Slice

Only using slices returns all ranges from all chromosomes and strands within those coordinates.

print(gr[0:100000])
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   |     Start |       End | Name       |     Score | Strand       |
## | (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr2         |     33241 |     33266 | U0         |         0 | +            |
## | chr2         |     13611 |     13636 | U0         |         0 | -            |
## | chr2         |     32620 |     32645 | U0         |         0 | -            |
## | chr3         |     87179 |     87204 | U0         |         0 | +            |
## | chr4         |     45413 |     45438 | U0         |         0 | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 5 rows and 6 columns from 3 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

Note that while the slice-operator is fast and returns seemingly immediately, it is inefficient; it builds the interval overlap datastructure anew every query. So if you have multiple queries you should build another PyRanges and do an intersect-operation.

Strand

print(gr["+"])
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | 0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | 0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 21559181  | 21559206  | U0         | 0         | +            |
## | chrY         | 11942770  | 11942795  | U0         | 0         | +            |
## | chrY         | 8316773   | 8316798   | U0         | 0         | +            |
## | chrY         | 7463444   | 7463469   | U0         | 0         | +            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 5,050 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

Slice and Strand

print(gr["+", 0:100000])
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   |     Start |       End | Name       |     Score | Strand       |
## | (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr2         |     33241 |     33266 | U0         |         0 | +            |
## | chr3         |     87179 |     87204 | U0         |         0 | +            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 2 rows and 6 columns from 2 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.