2 Loading/Creating PyRanges

A PyRanges object can be built in four ways:

  1. from a Pandas dataframe
  2. using the PyRanges constructor with the chromosomes, starts and ends (and optionally strands), individually.
  3. using one of the custom reader functions for genomic data (read_bed, read_bam or read_gtf, read_gff3)
  4. from a dict (like the ones produced with to_example)

Using a DataFrame

If you instantiate a PyRanges object from a dataframe, the dataframe should at least contain the columns Chromosome, Start and End. A column called Strand is optional. Any other columns in the dataframe are treated as metadata.

import pandas as pd
import pyranges as pr
chipseq = pr.get_example_path("chipseq.bed")
df = pd.read_csv(chipseq, header=None, names="Chromosome Start End Name Score Strand".split(), sep="\t")
print(df.head(2))
##   Chromosome      Start        End Name  Score Strand
## 0       chr8   28510032   28510057   U0      0      -
## 1       chr7  107153363  107153388   U0      0      -
print(df.tail(2))
##      Chromosome      Start        End Name  Score Strand
## 9998       chr1  194245558  194245583   U0      0      +
## 9999       chr8   57916061   57916086   U0      0      +
print(pr.PyRanges(df))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | 0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | 0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | U0         | 0         | -            |
## | chrY         | 13517892  | 13517917  | U0         | 0         | -            |
## | chrY         | 8010951   | 8010976   | U0         | 0         | -            |
## | chrY         | 7405376   | 7405401   | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

Using constructor keywords

The other way to instantiate a PyRanges object is to use the constructor with keywords:

gr = pr.PyRanges(chromosomes=df.Chromosome, starts=df.Start, ends=df.End)
print(gr)
## +--------------+-----------+-----------+
## | Chromosome   | Start     | End       |
## | (category)   | (int32)   | (int32)   |
## |--------------+-----------+-----------|
## | chr1         | 100079649 | 100079674 |
## | chr1         | 212609534 | 212609559 |
## | chr1         | 223587418 | 223587443 |
## | chr1         | 202450161 | 202450186 |
## | ...          | ...       | ...       |
## | chrY         | 11942770  | 11942795  |
## | chrY         | 8316773   | 8316798   |
## | chrY         | 7463444   | 7463469   |
## | chrY         | 7405376   | 7405401   |
## +--------------+-----------+-----------+
## Unstranded PyRanges object has 10,000 rows and 3 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome.

It is possible to make PyRanges objects out of basic Python datatypes:

gr = pr.PyRanges(chromosomes="chr1", strands="+", starts=[0, 1, 2], ends=(3, 4, 5))
print(gr)
## +--------------+-----------+-----------+--------------+
## | Chromosome   |     Start |       End | Strand       |
## | (category)   |   (int32) |   (int32) | (category)   |
## |--------------+-----------+-----------+--------------|
## | chr1         |         0 |         3 | +            |
## | chr1         |         1 |         4 | +            |
## | chr1         |         2 |         5 | +            |
## +--------------+-----------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 4 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
gr = pr.PyRanges(chromosomes="chr1 chr2 chr3".split(), strands="+ - +".split(), starts=[0, 1, 2], ends=(3, 4, 5))
print(gr)
## +--------------+-----------+-----------+--------------+
## | Chromosome   |     Start |       End | Strand       |
## | (category)   |   (int32) |   (int32) | (category)   |
## |--------------+-----------+-----------+--------------|
## | chr1         |         0 |         3 | +            |
## | chr2         |         1 |         4 | -            |
## | chr3         |         2 |         5 | +            |
## +--------------+-----------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 4 columns from 3 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

Using read_bed, read_gtf, read_gff3 or read_bam

The pyranges library can create PyRanges from gff3 common file formats, namely gtf/gff, gff3, bed and bam ^.

ensembl_path = pr.get_example_path("ensembl.gtf")
gr = pr.read_gtf(ensembl_path)
print(gr)
## +--------------+------------+--------------+-----------+-----------+-------+
## | Chromosome   | Source     | Feature      | Start     | End       | +21   |
## | (category)   | (object)   | (category)   | (int32)   | (int32)   | ...   |
## |--------------+------------+--------------+-----------+-----------+-------|
## | 1            | havana     | gene         | 11868     | 14409     | ...   |
## | 1            | havana     | transcript   | 11868     | 14409     | ...   |
## | 1            | havana     | exon         | 11868     | 12227     | ...   |
## | 1            | havana     | exon         | 12612     | 12721     | ...   |
## | ...          | ...        | ...          | ...       | ...       | ...   |
## | 1            | ensembl    | transcript   | 120724    | 133723    | ...   |
## | 1            | ensembl    | exon         | 133373    | 133723    | ...   |
## | 1            | ensembl    | exon         | 129054    | 129223    | ...   |
## | 1            | ensembl    | exon         | 120873    | 120932    | ...   |
## +--------------+------------+--------------+-----------+-----------+-------+
## Stranded PyRanges object has 95 rows and 26 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 21 hidden columns: Score, Strand, Frame, gene_id, gene_version, gene_name, gene_source, ... (+ 14 more.)

To read bam files the optional bamread-library must be installed. Use conda install -c bioconda bamread or pip install bamread to install it.

read_bam takes the arguments sparse, mapq, required_flag, filter_flag, which have the default values True, 0, 0 and 1540, respectively. With sparse True, only the columns ['Chromosome', 'Start', 'End', 'Strand', 'Flag'] are fetched. Setting sparse to False additionally gives you the columns ['QueryStart', 'QueryEnd', 'Name', 'Cigar', 'Quality'], but is more time and memory-consuming.

All the reader functions also take the flag as_df

2.0.0.1 Using from_dict

f1 = pr.data.f1()
d = f1.to_example(n=10)
print(d)
## {'Chromosome': ['chr1', 'chr1', 'chr1'], 'Start': [3, 8, 5], 'End': [6, 9, 7], 'Name': ['interval1', 'interval3', 'interval2'], 'Score': [0, 0, 0], 'Strand': ['+', '+', '-']}
print(pr.from_dict(d))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   |     Start |       End | Name       |     Score | Strand       |
## | (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         |         3 |         6 | interval1  |         0 | +            |
## | chr1         |         8 |         9 | interval3  |         0 | +            |
## | chr1         |         5 |         7 | interval2  |         0 | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.