31 GenomicFeatures: methods using genomic context

The PyRanges library has a few methods that operate on a genomic context. Their functions are found in the pyranges.gf namespace and the methods to be used on a PyRanges-object are found in the gr.features namespace.

introns can be used on pyranges annotated with gene_id and exon_id.

import pyranges as pr
gr = pr.data.ucsc_bed()
print(gr)
## +--------------+-----------+-----------+------------+------------+-------+
## | Chromosome   | Start     | End       | Feature    | gene_id    | +4    |
## | (category)   | (int32)   | (int32)   | (object)   | (object)   | ...   |
## |--------------+-----------+-----------+------------+------------+-------|
## | chr1         | 12776117  | 12788726  | gene       | AADACL3    | ...   |
## | chr1         | 169075927 | 169101957 | gene       | ATP1B1     | ...   |
## | chr1         | 6845383   | 7829766   | gene       | CAMTA1     | ...   |
## | chr1         | 20915589  | 20945396  | gene       | CDA        | ...   |
## | ...          | ...       | ...       | ...        | ...        | ...   |
## | chrX         | 152661096 | 152663330 | exon       | PNMA6E     | ...   |
## | chrX         | 152661096 | 152666808 | transcript | PNMA6E     | ...   |
## | chrX         | 152664164 | 152664378 | exon       | PNMA6E     | ...   |
## | chrX         | 152666701 | 152666808 | exon       | PNMA6E     | ...   |
## +--------------+-----------+-----------+------------+------------+-------+
## Stranded PyRanges object has 5,519 rows and 9 columns from 30 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 4 hidden columns: transcript_id, Strand, exon_number, transcript_name
print(gr.features.introns(by="transcript")) # default is by gene
## +--------------+-----------+-----------+------------+--------------+-------+
## | Chromosome   | Start     | End       | Feature    | gene_id      | +4    |
## | (object)     | (int32)   | (int32)   | (object)   | (object)     | ...   |
## |--------------+-----------+-----------+------------+--------------+-------|
## | chr1         | 12227     | 12612     | intron     | LOC102725121 | ...   |
## | chr1         | 12721     | 13220     | intron     | LOC102725121 | ...   |
## | chr1         | 12227     | 12612     | intron     | DDX11L1      | ...   |
## | chr1         | 12721     | 13220     | intron     | DDX11L1      | ...   |
## | ...          | ...       | ...       | ...        | ...          | ...   |
## | chrX         | 9714193   | 9716613   | intron     | GPR143       | ...   |
## | chrX         | 9716706   | 9727371   | intron     | GPR143       | ...   |
## | chrX         | 9727466   | 9728756   | intron     | GPR143       | ...   |
## | chrX         | 9728866   | 9733607   | intron     | GPR143       | ...   |
## +--------------+-----------+-----------+------------+--------------+-------+
## Stranded PyRanges object has 4,128 rows and 9 columns from 30 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 4 hidden columns: transcript_id, Strand, exon_number, transcript_name

tes/tss can be used to find the starts and ends of transcripts:

import pyranges as pr
gr = pr.data.ucsc_bed()
print(gr[gr.Feature == "transcript"])
## +--------------+-----------+-----------+------------+--------------+-------+
## | Chromosome   | Start     | End       | Feature    | gene_id      | +4    |
## | (category)   | (int32)   | (int32)   | (object)   | (object)     | ...   |
## |--------------+-----------+-----------+------------+--------------+-------|
## | chr1         | 11868     | 14362     | transcript | LOC102725121 | ...   |
## | chr1         | 11873     | 14409     | transcript | DDX11L1      | ...   |
## | chr1         | 30365     | 30503     | transcript | MIR1302-2    | ...   |
## | chr1         | 30365     | 30503     | transcript | MIR1302-9    | ...   |
## | ...          | ...       | ...       | ...        | ...          | ...   |
## | chrX         | 131337052 | 131352061 | transcript | RAP2C        | ...   |
## | chrX         | 134021661 | 134049287 | transcript | MOSPD1       | ...   |
## | chrX         | 152157367 | 152160757 | transcript | PNMA5        | ...   |
## | chrX         | 152661096 | 152666808 | transcript | PNMA6E       | ...   |
## +--------------+-----------+-----------+------------+--------------+-------+
## Stranded PyRanges object has 500 rows and 9 columns from 30 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 4 hidden columns: transcript_id, Strand, exon_number, transcript_name
print(gr.features.tes()) # default is by gene
## +--------------+-----------+-----------+------------+--------------+-------+
## | Chromosome   | Start     | End       | Feature    | gene_id      | +4    |
## | (category)   | (int32)   | (int32)   | (object)   | (object)     | ...   |
## |--------------+-----------+-----------+------------+--------------+-------|
## | chr1         | 14361     | 14362     | tes        | LOC102725121 | ...   |
## | chr1         | 14408     | 14409     | tes        | DDX11L1      | ...   |
## | chr1         | 30502     | 30503     | tes        | MIR1302-2    | ...   |
## | chr1         | 30502     | 30503     | tes        | MIR1302-9    | ...   |
## | ...          | ...       | ...       | ...        | ...          | ...   |
## | chrX         | 131337052 | 131337053 | tes        | RAP2C        | ...   |
## | chrX         | 134021661 | 134021662 | tes        | MOSPD1       | ...   |
## | chrX         | 152157367 | 152157368 | tes        | PNMA5        | ...   |
## | chrX         | 152661096 | 152661097 | tes        | PNMA6E       | ...   |
## +--------------+-----------+-----------+------------+--------------+-------+
## Stranded PyRanges object has 500 rows and 9 columns from 30 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 4 hidden columns: transcript_id, Strand, exon_number, transcript_name

tile_genome splits a PyRanges of chromosome sizes into a tiled genome.

import pyranges as pr
cs = pr.data.chromsizes()
print(cs)
# can also do
# pip install pyranges_db as db
# cs =  db.ucsc.chromsizes("hg19")
## +--------------+-----------+-----------+
## | Chromosome   | Start     | End       |
## | (category)   | (int32)   | (int32)   |
## |--------------+-----------+-----------|
## | chr1         | 0         | 249250621 |
## | chr2         | 0         | 243199373 |
## | chr3         | 0         | 198022430 |
## | chr4         | 0         | 191154276 |
## | ...          | ...       | ...       |
## | chr22        | 0         | 51304566  |
## | chrM         | 0         | 16571     |
## | chrX         | 0         | 155270560 |
## | chrY         | 0         | 59373566  |
## +--------------+-----------+-----------+
## Unstranded PyRanges object has 25 rows and 3 columns from 25 chromosomes.
## For printing, the PyRanges was sorted on Chromosome.
tile_size = int(1e6)
print(pr.gf.tile_genome(cs, tile_size, tile_last=False))
## +--------------+-----------+-----------+
## | Chromosome   | Start     | End       |
## | (category)   | (int32)   | (int32)   |
## |--------------+-----------+-----------|
## | chr1         | 0         | 1000000   |
## | chr1         | 1000000   | 2000000   |
## | chr1         | 2000000   | 3000000   |
## | chr1         | 3000000   | 4000000   |
## | ...          | ...       | ...       |
## | chrY         | 56000000  | 57000000  |
## | chrY         | 57000000  | 58000000  |
## | chrY         | 58000000  | 59000000  |
## | chrY         | 59000000  | 59373566  |
## +--------------+-----------+-----------+
## Unstranded PyRanges object has 3,114 rows and 3 columns from 25 chromosomes.
## For printing, the PyRanges was sorted on Chromosome.
print(pr.gf.tile_genome(cs, tile_size, tile_last=True))
## +--------------+-----------+-----------+
## | Chromosome   | Start     | End       |
## | (category)   | (int32)   | (int32)   |
## |--------------+-----------+-----------|
## | chr1         | 0         | 1000000   |
## | chr1         | 1000000   | 2000000   |
## | chr1         | 2000000   | 3000000   |
## | chr1         | 3000000   | 4000000   |
## | ...          | ...       | ...       |
## | chrY         | 56000000  | 57000000  |
## | chrY         | 57000000  | 58000000  |
## | chrY         | 58000000  | 59000000  |
## | chrY         | 59000000  | 60000000  |
## +--------------+-----------+-----------+
## Unstranded PyRanges object has 3,114 rows and 3 columns from 25 chromosomes.
## For printing, the PyRanges was sorted on Chromosome.

genome_bounds removes all intervals in the PyRanges which are outside the genome bounds. If the flag clip is used, the parts of the intervals which are inside the boundaries are kept.

import pyranges as pr
cs = pr.data.chromsizes()
gr = pr.data.chipseq()
print(cs)
# can also do
# pip install pyranges_db as db
# cs =  db.ucsc.chromsizes("hg19")
## +--------------+-----------+-----------+
## | Chromosome   | Start     | End       |
## | (category)   | (int32)   | (int32)   |
## |--------------+-----------+-----------|
## | chr1         | 0         | 249250621 |
## | chr2         | 0         | 243199373 |
## | chr3         | 0         | 198022430 |
## | chr4         | 0         | 191154276 |
## | ...          | ...       | ...       |
## | chr22        | 0         | 51304566  |
## | chrM         | 0         | 16571     |
## | chrX         | 0         | 155270560 |
## | chrY         | 0         | 59373566  |
## +--------------+-----------+-----------+
## Unstranded PyRanges object has 25 rows and 3 columns from 25 chromosomes.
## For printing, the PyRanges was sorted on Chromosome.
print(pr.gf.genome_bounds(gr, cs, clip=True))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | 0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | 0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | U0         | 0         | -            |
## | chrY         | 13517892  | 13517917  | U0         | 0         | -            |
## | chrY         | 8010951   | 8010976   | U0         | 0         | -            |
## | chrY         | 7405376   | 7405401   | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
print(pr.gf.genome_bounds(gr, cs))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | 0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | 0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | U0         | 0         | -            |
## | chrY         | 13517892  | 13517917  | U0         | 0         | -            |
## | chrY         | 8010951   | 8010976   | U0         | 0         | -            |
## | chrY         | 7405376   | 7405401   | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 9,979 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

random creates a random PyRanges from the PyRanges of chromosome sizes given:

pr.random(n=1000, length=100, chromsizes=None, strand=True)

if no chromsize is given, hg19 is used (from pr.data.chromsizes).