2 Loading/Creating PyRanges

A PyRanges object can be built in three ways:

  1. from a Pandas dataframe
  2. using the PyRanges constructor with the seqnames, starts and ends (and optionally strands), individually.
  3. using one of the custom reader functions for genomic data (read_bed, read_bam or read_gtf)

Using a DataFrame

If you instantiate a PyRanges object from a dataframe, the dataframe should at least contain the columns Chromosome, Start and End. A column called Strand is optional. Any other columns in the dataframe are treated as metadata.

import pandas as pd
import pyranges as pr
chipseq = pr.get_example_path("chipseq.bed")
df = pd.read_table(chipseq, header=None, names="Chromosome Start End Name Score Strand".split())
print(df.head(2))
##   Chromosome      Start        End Name  Score Strand
## 0       chr8   28510032   28510057   U0      0      -
## 1       chr7  107153363  107153388   U0      0      -
print(df.tail(2))
##      Chromosome      Start        End Name  Score Strand
## 9998       chr1  194245558  194245583   U0      0      +
## 9999       chr8   57916061   57916086   U0      0      +
print(pr.PyRanges(df))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr8         | 28510032  | 28510057  | U0         | 0         | -            |
## | chr7         | 107153363 | 107153388 | U0         | 0         | -            |
## | chr5         | 135821802 | 135821827 | U0         | 0         | -            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chr6         | 89296757  | 89296782  | U0         | 0         | -            |
## | chr1         | 194245558 | 194245583 | U0         | 0         | +            |
## | chr8         | 57916061  | 57916086  | U0         | 0         | +            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## PyRanges object has 10000 sequences from 24 chromosomes.

Using constructor keywords

The other way to instantiate a PyRanges object is to use the constructor with keywords:

gr = pr.PyRanges(seqnames=df.Chromosome, starts=df.Start, ends=df.End)
print(gr)
## +--------------+-----------+-----------+
## | Chromosome   | Start     | End       |
## | (object)     | (int64)   | (int64)   |
## |--------------+-----------+-----------|
## | chr8         | 28510032  | 28510057  |
## | chr7         | 107153363 | 107153388 |
## | chr5         | 135821802 | 135821827 |
## | ...          | ...       | ...       |
## | chr6         | 89296757  | 89296782  |
## | chr1         | 194245558 | 194245583 |
## | chr8         | 57916061  | 57916086  |
## +--------------+-----------+-----------+
## PyRanges object has 10000 sequences from 24 chromosomes.

It is possible to make PyRanges objects out of basic Python datatypes:

gr = pr.PyRanges(seqnames="chr1", strands="+", starts=[0, 1, 2], ends=(3, 4, 5))
print(gr)
## +--------------+-----------+-----------+--------------+
## | Chromosome   |     Start |       End | Strand       |
## | (category)   |   (int64) |   (int64) | (category)   |
## |--------------+-----------+-----------+--------------|
## | chr1         |         0 |         3 | +            |
## | chr1         |         1 |         4 | +            |
## | chr1         |         2 |         5 | +            |
## +--------------+-----------+-----------+--------------+
## PyRanges object has 3 sequences from 1 chromosomes.
gr = pr.PyRanges(seqnames="chr1 chr2 chr3".split(), strands="+ - +".split(), starts=[0, 1, 2], ends=(3, 4, 5))
print(gr)
## +--------------+-----------+-----------+------------+
## | Chromosome   |     Start |       End | Strand     |
## | (object)     |   (int64) |   (int64) | (object)   |
## |--------------+-----------+-----------+------------|
## | chr1         |         0 |         3 | +          |
## | chr2         |         1 |         4 | -          |
## | chr3         |         2 |         5 | +          |
## +--------------+-----------+-----------+------------+
## PyRanges object has 3 sequences from 3 chromosomes.

Using read_bed, read_gtf or read_bam

The pyranges library can create PyRanges from three common file formats, namely gtf, bed and bam 1.

ensembl_path = pr.get_example_path("ensembl.gtf")
gr = pr.read_gtf(ensembl_path)
print(gr)
## +--------------+-----------+-----------+--------------+--------------+-----------------+-----------------+--------------+-----------------+
## | Chromosome   | Start     | End       | Strand       | Feature      | GeneID          | TranscriptID    | ExonNumber   | ExonID          |
## | (category)   | (int64)   | (int64)   | (category)   | (category)   | (object)        | (object)        | (float64)    | (object)        |
## |--------------+-----------+-----------+--------------+--------------+-----------------+-----------------+--------------+-----------------|
## | 1            | 11869     | 14409     | +            | gene         | ENSG00000223972 | nan             | nan          | nan             |
## | 1            | 11869     | 14409     | +            | transcript   | ENSG00000223972 | ENST00000456328 | nan          | nan             |
## | 1            | 11869     | 12227     | +            | exon         | ENSG00000223972 | ENST00000456328 | 1.0          | ENSE00002234944 |
## | ...          | ...       | ...       | ...          | ...          | ...             | ...             | ...          | ...             |
## | 1            | 133374    | 133723    | -            | exon         | ENSG00000238009 | ENST00000610542 | 1.0          | ENSE00003748456 |
## | 1            | 129055    | 129223    | -            | exon         | ENSG00000238009 | ENST00000610542 | 2.0          | ENSE00003734824 |
## | 1            | 120874    | 120932    | -            | exon         | ENSG00000238009 | ENST00000610542 | 3.0          | ENSE00003740919 |
## +--------------+-----------+-----------+--------------+--------------+-----------------+-----------------+--------------+-----------------+
## PyRanges object has 95 sequences from 1 chromosomes.

  1. This is the same behavior as bedtools intersect.