2 Loading/Creating PyRanges
A PyRanges object can be built in four ways:
- from a Pandas dataframe
- using the PyRanges constructor with the chromosomes, starts and ends (and optionally strands), individually.
- using one of the custom reader functions for genomic data (
read_bed
,read_bam
orread_gtf
,read_gff3
) - from a dict (like the ones produced with
to_example
)
Using a DataFrame
If you instantiate a PyRanges object from a dataframe, the dataframe should at least contain the columns Chromosome, Start and End. A column called Strand is optional. Any other columns in the dataframe are treated as metadata.
import pandas as pd
import pyranges as pr
= pr.get_example_path("chipseq.bed")
chipseq = pd.read_csv(chipseq, header=None, names="Chromosome Start End Name Score Strand".split(), sep="\t")
df print(df.head(2))
## Chromosome Start End Name Score Strand
## 0 chr8 28510032 28510057 U0 0 -
## 1 chr7 107153363 107153388 U0 0 -
print(df.tail(2))
## Chromosome Start End Name Score Strand
## 9998 chr1 194245558 194245583 U0 0 +
## 9999 chr8 57916061 57916086 U0 0 +
print(pr.PyRanges(df))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 212609534 | 212609559 | U0 | 0 | + |
## | chr1 | 169887529 | 169887554 | U0 | 0 | + |
## | chr1 | 216711011 | 216711036 | U0 | 0 | + |
## | chr1 | 144227079 | 144227104 | U0 | 0 | + |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | U0 | 0 | - |
## | chrY | 13517892 | 13517917 | U0 | 0 | - |
## | chrY | 8010951 | 8010976 | U0 | 0 | - |
## | chrY | 7405376 | 7405401 | U0 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
Using constructor keywords
The other way to instantiate a PyRanges object is to use the constructor with keywords:
= pr.PyRanges(chromosomes=df.Chromosome, starts=df.Start, ends=df.End)
gr print(gr)
## +--------------+-----------+-----------+
## | Chromosome | Start | End |
## | (category) | (int32) | (int32) |
## |--------------+-----------+-----------|
## | chr1 | 100079649 | 100079674 |
## | chr1 | 212609534 | 212609559 |
## | chr1 | 223587418 | 223587443 |
## | chr1 | 202450161 | 202450186 |
## | ... | ... | ... |
## | chrY | 11942770 | 11942795 |
## | chrY | 8316773 | 8316798 |
## | chrY | 7463444 | 7463469 |
## | chrY | 7405376 | 7405401 |
## +--------------+-----------+-----------+
## Unstranded PyRanges object has 10,000 rows and 3 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome.
It is possible to make PyRanges objects out of basic Python datatypes:
= pr.PyRanges(chromosomes="chr1", strands="+", starts=[0, 1, 2], ends=(3, 4, 5))
gr print(gr)
## +--------------+-----------+-----------+--------------+
## | Chromosome | Start | End | Strand |
## | (category) | (int32) | (int32) | (category) |
## |--------------+-----------+-----------+--------------|
## | chr1 | 0 | 3 | + |
## | chr1 | 1 | 4 | + |
## | chr1 | 2 | 5 | + |
## +--------------+-----------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 4 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
= pr.PyRanges(chromosomes="chr1 chr2 chr3".split(), strands="+ - +".split(), starts=[0, 1, 2], ends=(3, 4, 5))
gr print(gr)
## +--------------+-----------+-----------+--------------+
## | Chromosome | Start | End | Strand |
## | (category) | (int32) | (int32) | (category) |
## |--------------+-----------+-----------+--------------|
## | chr1 | 0 | 3 | + |
## | chr2 | 1 | 4 | - |
## | chr3 | 2 | 5 | + |
## +--------------+-----------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 4 columns from 3 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
Using read_bed
, read_gtf
, read_gff3
or read_bam
The pyranges library can create PyRanges from gff3 common file formats, namely gtf/gff, gff3, bed and bam ^.
= pr.get_example_path("ensembl.gtf")
ensembl_path = pr.read_gtf(ensembl_path)
gr print(gr)
## +--------------+------------+--------------+-----------+-----------+-------+
## | Chromosome | Source | Feature | Start | End | +21 |
## | (category) | (object) | (category) | (int32) | (int32) | ... |
## |--------------+------------+--------------+-----------+-----------+-------|
## | 1 | havana | gene | 11868 | 14409 | ... |
## | 1 | havana | transcript | 11868 | 14409 | ... |
## | 1 | havana | exon | 11868 | 12227 | ... |
## | 1 | havana | exon | 12612 | 12721 | ... |
## | ... | ... | ... | ... | ... | ... |
## | 1 | ensembl | transcript | 120724 | 133723 | ... |
## | 1 | ensembl | exon | 133373 | 133723 | ... |
## | 1 | ensembl | exon | 129054 | 129223 | ... |
## | 1 | ensembl | exon | 120873 | 120932 | ... |
## +--------------+------------+--------------+-----------+-----------+-------+
## Stranded PyRanges object has 95 rows and 26 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 21 hidden columns: Score, Strand, Frame, gene_id, gene_version, gene_name, gene_source, ... (+ 14 more.)
To read bam files the optional bamread-library must be installed. Use conda install -c bioconda bamread
or pip install bamread
to install it.
read_bam
takes the arguments sparse
, mapq
, required_flag
, filter_flag
,
which have the default values True, 0, 0 and 1540, respectively. With sparse
True, only the columns ['Chromosome', 'Start', 'End', 'Strand', 'Flag']
are
fetched. Setting sparse to False additionally gives you the columns ['QueryStart', 'QueryEnd', 'Name', 'Cigar', 'Quality']
, but is more time and memory-consuming.
All the reader functions also take the flag as_df
2.0.0.1 Using from_dict
= pr.data.f1()
f1 = f1.to_example(n=10)
d print(d)
## {'Chromosome': ['chr1', 'chr1', 'chr1'], 'Start': [3, 8, 5], 'End': [6, 9, 7], 'Name': ['interval1', 'interval3', 'interval2'], 'Score': [0, 0, 0], 'Strand': ['+', '+', '-']}
print(pr.from_dict(d))
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 3 | 6 | interval1 | 0 | + |
## | chr1 | 8 | 9 | interval3 | 0 | + |
## | chr1 | 5 | 7 | interval2 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.