4 Manipulating the data in PyRanges

PyRanges is a thin wrapper around genomic data contained in pandas dataframes. This dataframe is accessible with the df attribute of the PyRanges object.

import pyranges as pr
gr = pr.load_dataset("chipseq")
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr8         | 28510032  | 28510057  | U0         | 0         | -            |
## | chr7         | 107153363 | 107153388 | U0         | 0         | -            |
## | chr5         | 135821802 | 135821827 | U0         | 0         | -            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chr6         | 89296757  | 89296782  | U0         | 0         | -            |
## | chr1         | 194245558 | 194245583 | U0         | 0         | +            |
## | chr8         | 57916061  | 57916086  | U0         | 0         | +            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## PyRanges object has 10000 sequences from 24 chromosomes.
print(gr.df.head(5))
##   Chromosome      Start        End Name  Score Strand
## 0       chr8   28510032   28510057   U0      0      -
## 1       chr7  107153363  107153388   U0      0      -
## 2       chr5  135821802  135821827   U0      0      -
## 3      chr14   19418999   19419024   U0      0      -
## 4      chr12  106679761  106679786   U0      0      -

To access a column of this dataframe, you can ask for the name directly from the PyRanges object.

print(gr.Start.head())
## 0     28510032
## 1    107153363
## 2    135821802
## 3     19418999
## 4    106679761
## Name: Start, dtype: int64

You can directly insert a column by setting the attribute on the PyRanges object:

gr.stupid_example = "Hi There!"
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+------------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       | stupid_example   |
## | (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   | (object)         |
## |--------------+-----------+-----------+------------+-----------+--------------+------------------|
## | chr8         | 28510032  | 28510057  | U0         | 0         | -            | Hi There!        |
## | chr7         | 107153363 | 107153388 | U0         | 0         | -            | Hi There!        |
## | chr5         | 135821802 | 135821827 | U0         | 0         | -            | Hi There!        |
## | ...          | ...       | ...       | ...        | ...       | ...          | ...              |
## | chr6         | 89296757  | 89296782  | U0         | 0         | -            | Hi There!        |
## | chr1         | 194245558 | 194245583 | U0         | 0         | +            | Hi There!        |
## | chr8         | 57916061  | 57916086  | U0         | 0         | +            | Hi There!        |
## +--------------+-----------+-----------+------------+-----------+--------------+------------------+
## PyRanges object has 10000 sequences from 24 chromosomes.
gr.df.drop("stupid_example", axis=1, inplace=True)
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr8         | 28510032  | 28510057  | U0         | 0         | -            |
## | chr7         | 107153363 | 107153388 | U0         | 0         | -            |
## | chr5         | 135821802 | 135821827 | U0         | 0         | -            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chr6         | 89296757  | 89296782  | U0         | 0         | -            |
## | chr1         | 194245558 | 194245583 | U0         | 0         | +            |
## | chr8         | 57916061  | 57916086  | U0         | 0         | +            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## PyRanges object has 10000 sequences from 24 chromosomes.

All columns except Chromosome, Start, End and Strand can be changed in any way you please and more metadata-columns can be added by setting it on the PyRanges object. If you wish to change the Chromosome, Start, End and Strand columns you should make a copy of the data from the PyRanges object and use it to instantiate a new PyRanges object.

import pandas as pd
gr.Name = gr.Chromosome.astype(str) + "_" + pd.Series(range(len(gr))).astype(str)
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr8         | 28510032  | 28510057  | chr8_0     | 0         | -            |
## | chr7         | 107153363 | 107153388 | chr7_1     | 0         | -            |
## | chr5         | 135821802 | 135821827 | chr5_2     | 0         | -            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chr6         | 89296757  | 89296782  | chr6_9997  | 0         | -            |
## | chr1         | 194245558 | 194245583 | chr1_9998  | 0         | +            |
## | chr8         | 57916061  | 57916086  | chr8_9999  | 0         | +            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## PyRanges object has 10000 sequences from 24 chromosomes.