12 Methods for manipulating single PyRanges
There are several methods for manipulating the contents of a PyRanges.
merge
creates a union of all the intervals in the ranges:
The merge-method also takes the flag count to let you know the number of intervals that were merged.
import pyranges as pr
= pr.data.f1()
f1 print(f1.merge(count=True))
## +--------------+-----------+-----------+--------------+-----------+
## | Chromosome | Start | End | Strand | Count |
## | (category) | (int32) | (int32) | (category) | (int32) |
## |--------------+-----------+-----------+--------------+-----------|
## | chr1 | 3 | 6 | + | 1 |
## | chr1 | 8 | 9 | + | 1 |
## | chr1 | 5 | 7 | - | 1 |
## +--------------+-----------+-----------+--------------+-----------+
## Stranded PyRanges object has 3 rows and 5 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
Cluster finds which intervals overlap, but gives each interval a cluster ID instead of merging them:
import pyranges as pr
= pr.data.f1()
f1 print(f1.cluster())
## +--------------+-----------+-----------+------------+-----------+-------+
## | Chromosome | Start | End | Name | Score | +2 |
## | (category) | (int32) | (int32) | (object) | (int64) | ... |
## |--------------+-----------+-----------+------------+-----------+-------|
## | chr1 | 3 | 6 | interval1 | 0 | ... |
## | chr1 | 8 | 9 | interval3 | 0 | ... |
## | chr1 | 5 | 7 | interval2 | 0 | ... |
## +--------------+-----------+-----------+------------+-----------+-------+
## Stranded PyRanges object has 3 rows and 7 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 2 hidden columns: Strand, Cluster
print(f1.cluster(strand=True))
## +--------------+-----------+-----------+------------+-----------+-------+
## | Chromosome | Start | End | Name | Score | +2 |
## | (category) | (int32) | (int32) | (object) | (int64) | ... |
## |--------------+-----------+-----------+------------+-----------+-------|
## | chr1 | 3 | 6 | interval1 | 0 | ... |
## | chr1 | 8 | 9 | interval3 | 0 | ... |
## | chr1 | 5 | 7 | interval2 | 0 | ... |
## +--------------+-----------+-----------+------------+-----------+-------+
## Stranded PyRanges object has 3 rows and 7 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 2 hidden columns: Strand, Cluster
print(f1.cluster(slack=2, strand=True, count=True))
## +--------------+-----------+-----------+------------+-----------+-------+
## | Chromosome | Start | End | Name | Score | +3 |
## | (category) | (int32) | (int32) | (object) | (int64) | ... |
## |--------------+-----------+-----------+------------+-----------+-------|
## | chr1 | 3 | 6 | interval1 | 0 | ... |
## | chr1 | 8 | 9 | interval3 | 0 | ... |
## | chr1 | 5 | 7 | interval2 | 0 | ... |
## +--------------+-----------+-----------+------------+-----------+-------+
## Stranded PyRanges object has 3 rows and 8 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 3 hidden columns: Strand, Cluster, Count
It also takes an flag count.
Both cluster and merge take the argument slack, so that you can merge features which are not directly overlapping. If you set slack to -1 you avoid merging bookended features. To only merge features overlapping by at least X, set slack to -X.
To cluster or merge only those rows which have the same values for a certain column, you can use the argument by to cluster and merge.
import pyranges as pr
= pr.data.ensembl_gtf()
gr print(gr)
## +--------------+------------+--------------+-----------+-----------+-------+
## | Chromosome | Source | Feature | Start | End | +23 |
## | (category) | (object) | (category) | (int32) | (int32) | ... |
## |--------------+------------+--------------+-----------+-----------+-------|
## | 1 | havana | gene | 11868 | 14409 | ... |
## | 1 | havana | transcript | 11868 | 14409 | ... |
## | 1 | havana | exon | 11868 | 12227 | ... |
## | 1 | havana | exon | 12612 | 12721 | ... |
## | ... | ... | ... | ... | ... | ... |
## | 1 | havana | gene | 1173055 | 1179555 | ... |
## | 1 | havana | transcript | 1173055 | 1179555 | ... |
## | 1 | havana | exon | 1179364 | 1179555 | ... |
## | 1 | havana | exon | 1173055 | 1176396 | ... |
## +--------------+------------+--------------+-----------+-----------+-------+
## Stranded PyRanges object has 2,446 rows and 28 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 23 hidden columns: Score, Strand, Frame, gene_biotype, gene_id, gene_name, gene_source, ... (+ 16 more.)
print(gr.cluster(by="gene_id"))
## +--------------+----------------+----------------+-----------+-------+
## | Chromosome | Source | Feature | Start | +25 |
## | (category) | (object) | (category) | (int32) | ... |
## |--------------+----------------+----------------+-----------+-------|
## | 1 | ensembl_havana | gene | 1173883 | ... |
## | 1 | havana | transcript | 1173883 | ... |
## | 1 | havana | exon | 1173883 | ... |
## | 1 | havana | transcript | 1173902 | ... |
## | ... | ... | ... | ... | ... |
## | 1 | ensembl_havana | stop_codon | 450739 | ... |
## | 1 | ensembl_havana | CDS | 450742 | ... |
## | 1 | ensembl_havana | start_codon | 451675 | ... |
## | 1 | ensembl_havana | five_prime_utr | 451678 | ... |
## +--------------+----------------+----------------+-----------+-------+
## Stranded PyRanges object has 2,446 rows and 29 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 25 hidden columns: End, Score, Strand, Frame, gene_biotype, gene_id, gene_name, ... (+ 18 more.)
If you want to split a pyrange on the overlapping intervals, you can use split:
import pyranges as pr
= pr.data.f1()
f1 print(f1)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 3 | 6 | interval1 | 0 | + |
## | chr1 | 8 | 9 | interval3 | 0 | + |
## | chr1 | 5 | 7 | interval2 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
print(f1.split())
## +--------------+-----------+-----------+------------+
## | Chromosome | Start | End | Strand |
## | (object) | (int32) | (int32) | (object) |
## |--------------+-----------+-----------+------------|
## | chr1 | 3 | 6 | + |
## | chr1 | 8 | 9 | + |
## | chr1 | 5 | 7 | - |
## +--------------+-----------+-----------+------------+
## Stranded PyRanges object has 3 rows and 4 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
five_end
finds the starts of the regions (taking direction of transcription
into account).
f1.five_end()print(f1.five_end())
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 3 | 4 | interval1 | 0 | + |
## | chr1 | 8 | 9 | interval3 | 0 | + |
## | chr1 | 6 | 7 | interval2 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
three_end
finds the ends of the regions (taking direction of transcription into account).
f1.three_end()print(f1.three_end())
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 5 | 6 | interval1 | 0 | + |
## | chr1 | 8 | 9 | interval3 | 0 | + |
## | chr1 | 5 | 6 | interval2 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
extend
(also aliased slack
) extends the starts and ends of your interval:
print(f1.slack(5))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 0 | 11 | interval1 | 0 | + |
## | chr1 | 3 | 14 | interval3 | 0 | + |
## | chr1 | 0 | 12 | interval2 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
print(f1.slack({"5": 2, "3": -1}))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 1 | 5 | interval1 | 0 | + |
## | chr1 | 6 | 8 | interval3 | 0 | + |
## | chr1 | 6 | 9 | interval2 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
window
splits your data into windows:
= pr.data.exons()
exons print(exons)
## +--------------+-----------+-----------+-------+
## | Chromosome | Start | End | +3 |
## | (category) | (int32) | (int32) | ... |
## |--------------+-----------+-----------+-------|
## | chrX | 135721701 | 135721963 | ... |
## | chrX | 135574120 | 135574598 | ... |
## | chrX | 47868945 | 47869126 | ... |
## | chrX | 77294333 | 77294480 | ... |
## | ... | ... | ... | ... |
## | chrY | 15409586 | 15409728 | ... |
## | chrY | 15478146 | 15478273 | ... |
## | chrY | 15360258 | 15361762 | ... |
## | chrY | 15467254 | 15467278 | ... |
## +--------------+-----------+-----------+-------+
## Stranded PyRanges object has 1,000 rows and 6 columns from 2 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 3 hidden columns: Name, Score, Strand
print(exons.window(5))
## +--------------+-----------+-----------+-------+
## | Chromosome | Start | End | +3 |
## | (category) | (int32) | (int32) | ... |
## |--------------+-----------+-----------+-------|
## | chrX | 135721701 | 135721706 | ... |
## | chrX | 135721706 | 135721711 | ... |
## | chrX | 135721711 | 135721716 | ... |
## | chrX | 135721716 | 135721721 | ... |
## | ... | ... | ... | ... |
## | chrY | 15467259 | 15467264 | ... |
## | chrY | 15467264 | 15467269 | ... |
## | chrY | 15467269 | 15467274 | ... |
## | chrY | 15467274 | 15467278 | ... |
## +--------------+-----------+-----------+-------+
## Stranded PyRanges object has 61,268 rows and 6 columns from 2 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 3 hidden columns: Name, Score, Strand
If you want to all windows to start at a multiple of window_size, use the tile function:
You can use the overlap flag to see how many basepairs each tile overlapped with the underlying interval.
print(exons.tile(5))
## +--------------+-----------+-----------+-------+
## | Chromosome | Start | End | +3 |
## | (category) | (int32) | (int32) | ... |
## |--------------+-----------+-----------+-------|
## | chrX | 135721700 | 135721705 | ... |
## | chrX | 135721705 | 135721710 | ... |
## | chrX | 135721710 | 135721715 | ... |
## | chrX | 135721715 | 135721720 | ... |
## | ... | ... | ... | ... |
## | chrY | 15467260 | 15467265 | ... |
## | chrY | 15467265 | 15467270 | ... |
## | chrY | 15467270 | 15467275 | ... |
## | chrY | 15467275 | 15467280 | ... |
## +--------------+-----------+-----------+-------+
## Stranded PyRanges object has 61,643 rows and 6 columns from 2 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 3 hidden columns: Name, Score, Strand
print(exons.tile(5, overlap=True))
## +--------------+-----------+-----------+-------+
## | Chromosome | Start | End | +4 |
## | (category) | (int32) | (int32) | ... |
## |--------------+-----------+-----------+-------|
## | chrX | 135721700 | 135721705 | ... |
## | chrX | 135721705 | 135721710 | ... |
## | chrX | 135721710 | 135721715 | ... |
## | chrX | 135721715 | 135721720 | ... |
## | ... | ... | ... | ... |
## | chrY | 15467260 | 15467265 | ... |
## | chrY | 15467265 | 15467270 | ... |
## | chrY | 15467270 | 15467275 | ... |
## | chrY | 15467275 | 15467280 | ... |
## +--------------+-----------+-----------+-------+
## Stranded PyRanges object has 61,643 rows and 7 columns from 2 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 4 hidden columns: Name, Score, Strand, TileOverlap