4 Manipulating the data in PyRanges
PyRanges is a thin wrapper around genomic data contained in pandas dataframes. This dataframe is accessible with the df attribute of the PyRanges object.
import pyranges as pr
gr = pr.load_dataset("chipseq")
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int64) | (int64) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr8 | 28510032 | 28510057 | U0 | 0 | - |
## | chr7 | 107153363 | 107153388 | U0 | 0 | - |
## | chr5 | 135821802 | 135821827 | U0 | 0 | - |
## | ... | ... | ... | ... | ... | ... |
## | chr6 | 89296757 | 89296782 | U0 | 0 | - |
## | chr1 | 194245558 | 194245583 | U0 | 0 | + |
## | chr8 | 57916061 | 57916086 | U0 | 0 | + |
## +--------------+-----------+-----------+------------+-----------+--------------+
## PyRanges object has 10000 sequences from 24 chromosomes.
print(gr.df.head(5))
## Chromosome Start End Name Score Strand
## 0 chr8 28510032 28510057 U0 0 -
## 1 chr7 107153363 107153388 U0 0 -
## 2 chr5 135821802 135821827 U0 0 -
## 3 chr14 19418999 19419024 U0 0 -
## 4 chr12 106679761 106679786 U0 0 -
To access a column of this dataframe, you can ask for the name directly from the PyRanges object.
print(gr.Start.head())
## 0 28510032
## 1 107153363
## 2 135821802
## 3 19418999
## 4 106679761
## Name: Start, dtype: int64
You can directly insert a column by setting the attribute on the PyRanges object:
gr.stupid_example = "Hi There!"
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+------------------+
## | Chromosome | Start | End | Name | Score | Strand | stupid_example |
## | (category) | (int64) | (int64) | (object) | (int64) | (category) | (object) |
## |--------------+-----------+-----------+------------+-----------+--------------+------------------|
## | chr8 | 28510032 | 28510057 | U0 | 0 | - | Hi There! |
## | chr7 | 107153363 | 107153388 | U0 | 0 | - | Hi There! |
## | chr5 | 135821802 | 135821827 | U0 | 0 | - | Hi There! |
## | ... | ... | ... | ... | ... | ... | ... |
## | chr6 | 89296757 | 89296782 | U0 | 0 | - | Hi There! |
## | chr1 | 194245558 | 194245583 | U0 | 0 | + | Hi There! |
## | chr8 | 57916061 | 57916086 | U0 | 0 | + | Hi There! |
## +--------------+-----------+-----------+------------+-----------+--------------+------------------+
## PyRanges object has 10000 sequences from 24 chromosomes.
gr.df.drop("stupid_example", axis=1, inplace=True)
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int64) | (int64) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr8 | 28510032 | 28510057 | U0 | 0 | - |
## | chr7 | 107153363 | 107153388 | U0 | 0 | - |
## | chr5 | 135821802 | 135821827 | U0 | 0 | - |
## | ... | ... | ... | ... | ... | ... |
## | chr6 | 89296757 | 89296782 | U0 | 0 | - |
## | chr1 | 194245558 | 194245583 | U0 | 0 | + |
## | chr8 | 57916061 | 57916086 | U0 | 0 | + |
## +--------------+-----------+-----------+------------+-----------+--------------+
## PyRanges object has 10000 sequences from 24 chromosomes.
All columns except Chromosome, Start, End and Strand can be changed in any way you please and more metadata-columns can be added by setting it on the PyRanges object. If you wish to change the Chromosome, Start, End and Strand columns you should make a copy of the data from the PyRanges object and use it to instantiate a new PyRanges object.
import pandas as pd
gr.Name = gr.Chromosome.astype(str) + "_" + pd.Series(range(len(gr))).astype(str)
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int64) | (int64) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr8 | 28510032 | 28510057 | chr8_0 | 0 | - |
## | chr7 | 107153363 | 107153388 | chr7_1 | 0 | - |
## | chr5 | 135821802 | 135821827 | chr5_2 | 0 | - |
## | ... | ... | ... | ... | ... | ... |
## | chr6 | 89296757 | 89296782 | chr6_9997 | 0 | - |
## | chr1 | 194245558 | 194245583 | chr1_9998 | 0 | + |
## | chr8 | 57916061 | 57916086 | chr8_9999 | 0 | + |
## +--------------+-----------+-----------+------------+-----------+--------------+
## PyRanges object has 10000 sequences from 24 chromosomes.