2 Loading/Creating PyRanges
A PyRanges object can be built in three ways:
- from a Pandas dataframe
- using the PyRanges constructor with the seqnames, starts and ends (and optionally strands), individually.
- using one of the custom reader functions for genomic data (
read_bed
,read_bam
orread_gtf
)
Using a DataFrame
If you instantiate a PyRanges object from a dataframe, the dataframe should at least contain the columns Chromosome, Start and End. A column called Strand is optional. Any other columns in the dataframe are treated as metadata.
import pandas as pd
import pyranges as pr
chipseq = pr.get_example_path("chipseq.bed")
df = pd.read_table(chipseq, header=None, names="Chromosome Start End Name Score Strand".split())
print(df.head(2))
## Chromosome Start End Name Score Strand
## 0 chr8 28510032 28510057 U0 0 -
## 1 chr7 107153363 107153388 U0 0 -
print(df.tail(2))
## Chromosome Start End Name Score Strand
## 9998 chr1 194245558 194245583 U0 0 +
## 9999 chr8 57916061 57916086 U0 0 +
print(pr.PyRanges(df))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int64) | (int64) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr8 | 28510032 | 28510057 | U0 | 0 | - |
## | chr7 | 107153363 | 107153388 | U0 | 0 | - |
## | chr5 | 135821802 | 135821827 | U0 | 0 | - |
## | ... | ... | ... | ... | ... | ... |
## | chr6 | 89296757 | 89296782 | U0 | 0 | - |
## | chr1 | 194245558 | 194245583 | U0 | 0 | + |
## | chr8 | 57916061 | 57916086 | U0 | 0 | + |
## +--------------+-----------+-----------+------------+-----------+--------------+
## PyRanges object has 10000 sequences from 24 chromosomes.
Using constructor keywords
The other way to instantiate a PyRanges object is to use the constructor with keywords:
gr = pr.PyRanges(seqnames=df.Chromosome, starts=df.Start, ends=df.End)
print(gr)
## +--------------+-----------+-----------+
## | Chromosome | Start | End |
## | (object) | (int64) | (int64) |
## |--------------+-----------+-----------|
## | chr8 | 28510032 | 28510057 |
## | chr7 | 107153363 | 107153388 |
## | chr5 | 135821802 | 135821827 |
## | ... | ... | ... |
## | chr6 | 89296757 | 89296782 |
## | chr1 | 194245558 | 194245583 |
## | chr8 | 57916061 | 57916086 |
## +--------------+-----------+-----------+
## PyRanges object has 10000 sequences from 24 chromosomes.
It is possible to make PyRanges objects out of basic Python datatypes:
gr = pr.PyRanges(seqnames="chr1", strands="+", starts=[0, 1, 2], ends=(3, 4, 5))
print(gr)
## +--------------+-----------+-----------+--------------+
## | Chromosome | Start | End | Strand |
## | (category) | (int64) | (int64) | (category) |
## |--------------+-----------+-----------+--------------|
## | chr1 | 0 | 3 | + |
## | chr1 | 1 | 4 | + |
## | chr1 | 2 | 5 | + |
## +--------------+-----------+-----------+--------------+
## PyRanges object has 3 sequences from 1 chromosomes.
gr = pr.PyRanges(seqnames="chr1 chr2 chr3".split(), strands="+ - +".split(), starts=[0, 1, 2], ends=(3, 4, 5))
print(gr)
## +--------------+-----------+-----------+------------+
## | Chromosome | Start | End | Strand |
## | (object) | (int64) | (int64) | (object) |
## |--------------+-----------+-----------+------------|
## | chr1 | 0 | 3 | + |
## | chr2 | 1 | 4 | - |
## | chr3 | 2 | 5 | + |
## +--------------+-----------+-----------+------------+
## PyRanges object has 3 sequences from 3 chromosomes.
Using read_bed
, read_gtf
or read_bam
The pyranges library can create PyRanges from three common file formats, namely gtf, bed and bam 1.
ensembl_path = pr.get_example_path("ensembl.gtf")
gr = pr.read_gtf(ensembl_path)
print(gr)
## +--------------+-----------+-----------+--------------+--------------+-----------------+-----------------+--------------+-----------------+
## | Chromosome | Start | End | Strand | Feature | GeneID | TranscriptID | ExonNumber | ExonID |
## | (category) | (int64) | (int64) | (category) | (category) | (object) | (object) | (float64) | (object) |
## |--------------+-----------+-----------+--------------+--------------+-----------------+-----------------+--------------+-----------------|
## | 1 | 11869 | 14409 | + | gene | ENSG00000223972 | nan | nan | nan |
## | 1 | 11869 | 14409 | + | transcript | ENSG00000223972 | ENST00000456328 | nan | nan |
## | 1 | 11869 | 12227 | + | exon | ENSG00000223972 | ENST00000456328 | 1.0 | ENSE00002234944 |
## | ... | ... | ... | ... | ... | ... | ... | ... | ... |
## | 1 | 133374 | 133723 | - | exon | ENSG00000238009 | ENST00000610542 | 1.0 | ENSE00003748456 |
## | 1 | 129055 | 129223 | - | exon | ENSG00000238009 | ENST00000610542 | 2.0 | ENSE00003734824 |
## | 1 | 120874 | 120932 | - | exon | ENSG00000238009 | ENST00000610542 | 3.0 | ENSE00003740919 |
## +--------------+-----------+-----------+--------------+--------------+-----------------+-----------------+--------------+-----------------+
## PyRanges object has 95 sequences from 1 chromosomes.
This is the same behavior as bedtools intersect.↩