5 Manipulating the data in PyRanges
PyRanges is a thin wrapper around genomic data contained in pandas dataframes. This dataframe is accessible with the df attribute of the PyRanges object.
import pyranges as pr
= pr.data.chipseq()
gr print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 212609534 | 212609559 | U0 | 0 | + |
## | chr1 | 169887529 | 169887554 | U0 | 0 | + |
## | chr1 | 216711011 | 216711036 | U0 | 0 | + |
## | chr1 | 144227079 | 144227104 | U0 | 0 | + |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | U0 | 0 | - |
## | chrY | 13517892 | 13517917 | U0 | 0 | - |
## | chrY | 8010951 | 8010976 | U0 | 0 | - |
## | chrY | 7405376 | 7405401 | U0 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
print(gr.df.head(5))
## Chromosome Start End Name Score Strand
## 0 chr1 212609534 212609559 U0 0 +
## 1 chr1 169887529 169887554 U0 0 +
## 2 chr1 216711011 216711036 U0 0 +
## 3 chr1 144227079 144227104 U0 0 +
## 4 chr1 148177825 148177850 U0 0 +
To access a column of this dataframe, you can ask for the name directly from the PyRanges object.
print(gr.Start.head())
## 18 212609534
## 70 169887529
## 129 216711011
## 170 144227079
## 196 148177825
## Name: Start, dtype: int32
You can directly insert a column by setting the attribute on the PyRanges object:
= "Hi There!"
gr.stupid_example print(gr)
## +--------------+-----------+-----------+------------+-----------+-------+
## | Chromosome | Start | End | Name | Score | +2 |
## | (category) | (int32) | (int32) | (object) | (int64) | ... |
## |--------------+-----------+-----------+------------+-----------+-------|
## | chr1 | 212609534 | 212609559 | U0 | 0 | ... |
## | chr1 | 169887529 | 169887554 | U0 | 0 | ... |
## | chr1 | 216711011 | 216711036 | U0 | 0 | ... |
## | chr1 | 144227079 | 144227104 | U0 | 0 | ... |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | U0 | 0 | ... |
## | chrY | 13517892 | 13517917 | U0 | 0 | ... |
## | chrY | 8010951 | 8010976 | U0 | 0 | ... |
## | chrY | 7405376 | 7405401 | U0 | 0 | ... |
## +--------------+-----------+-----------+------------+-----------+-------+
## Stranded PyRanges object has 10,000 rows and 7 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 2 hidden columns: Strand, stupid_example
= gr.drop("stupid_example")
gr print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 212609534 | 212609559 | U0 | 0 | + |
## | chr1 | 169887529 | 169887554 | U0 | 0 | + |
## | chr1 | 216711011 | 216711036 | U0 | 0 | + |
## | chr1 | 144227079 | 144227104 | U0 | 0 | + |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | U0 | 0 | - |
## | chrY | 13517892 | 13517917 | U0 | 0 | - |
## | chrY | 8010951 | 8010976 | U0 | 0 | - |
## | chrY | 7405376 | 7405401 | U0 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
And as you can see, you can drop columns from the PyRanges using drop. Using drop by itself removes all metadata-columns.
To insert a dataframe into a PyRanges object, you can use insert. Insert supports an optional insertion index loc:
= pd.DataFrame({"A1": [1, 2] * 5000, "A2": [3, 4, 5, 6] * 2500})
df print(df.head())
## A1 A2
## 0 1 3
## 1 2 4
## 2 1 5
## 3 2 6
## 4 1 3
print(gr.insert(df))
## +--------------+-----------+-----------+------------+-----------+-------+
## | Chromosome | Start | End | Name | Score | +3 |
## | (category) | (int32) | (int32) | (object) | (int64) | ... |
## |--------------+-----------+-----------+------------+-----------+-------|
## | chr1 | 212609534 | 212609559 | U0 | 0 | ... |
## | chr1 | 169887529 | 169887554 | U0 | 0 | ... |
## | chr1 | 216711011 | 216711036 | U0 | 0 | ... |
## | chr1 | 144227079 | 144227104 | U0 | 0 | ... |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | U0 | 0 | ... |
## | chrY | 13517892 | 13517917 | U0 | 0 | ... |
## | chrY | 8010951 | 8010976 | U0 | 0 | ... |
## | chrY | 7405376 | 7405401 | U0 | 0 | ... |
## +--------------+-----------+-----------+------------+-----------+-------+
## Stranded PyRanges object has 10,000 rows and 8 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 3 hidden columns: Strand, A1, A2
print(gr.insert(df, loc=3))
## +--------------+-----------+-----------+-----------+-----------+-------+
## | Chromosome | Start | End | A1 | A2 | +3 |
## | (category) | (int32) | (int32) | (int64) | (int64) | ... |
## |--------------+-----------+-----------+-----------+-----------+-------|
## | chr1 | 212609534 | 212609559 | 1 | 3 | ... |
## | chr1 | 169887529 | 169887554 | 2 | 4 | ... |
## | chr1 | 216711011 | 216711036 | 1 | 5 | ... |
## | chr1 | 144227079 | 144227104 | 2 | 6 | ... |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | 1 | 3 | ... |
## | chrY | 13517892 | 13517917 | 2 | 4 | ... |
## | chrY | 8010951 | 8010976 | 1 | 5 | ... |
## | chrY | 7405376 | 7405401 | 2 | 6 | ... |
## +--------------+-----------+-----------+-----------+-----------+-------+
## Stranded PyRanges object has 10,000 rows and 8 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 3 hidden columns: Name, Score, Strand
To rename the columns you can use the set_columns() method.
If you want to remove duplicates based on position, you can use drop_duplicate_positions:
print(gr.drop_duplicate_positions(strand=False)) # defaults to True
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 212609534 | 212609559 | U0 | 0 | + |
## | chr1 | 169887529 | 169887554 | U0 | 0 | + |
## | chr1 | 216711011 | 216711036 | U0 | 0 | + |
## | chr1 | 144227079 | 144227104 | U0 | 0 | + |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | U0 | 0 | - |
## | chrY | 13517892 | 13517917 | U0 | 0 | - |
## | chrY | 8010951 | 8010976 | U0 | 0 | - |
## | chrY | 7405376 | 7405401 | U0 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 9,924 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
All columns, including Chromosome, Start, End and Strand can be changed in any way you please and more metadata-columns can be added by setting it on the PyRanges object.
import pandas as pd
= gr.Chromosome.astype(str) + "_" + pd.Series(range(len(gr)), index=gr.Chromosome.index).astype(str)
gr.Name print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 212609534 | 212609559 | chr1_0 | 0 | + |
## | chr1 | 169887529 | 169887554 | chr1_1 | 0 | + |
## | chr1 | 216711011 | 216711036 | chr1_2 | 0 | + |
## | chr1 | 144227079 | 144227104 | chr1_3 | 0 | + |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | chrY_9996 | 0 | - |
## | chrY | 13517892 | 13517917 | chrY_9997 | 0 | - |
## | chrY | 8010951 | 8010976 | chrY_9998 | 0 | - |
## | chrY | 7405376 | 7405401 | chrY_9999 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
= "."
gr.Strand print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 212609534 | 212609559 | chr1_0 | 0 | . |
## | chr1 | 169887529 | 169887554 | chr1_1 | 0 | . |
## | chr1 | 216711011 | 216711036 | chr1_2 | 0 | . |
## | chr1 | 144227079 | 144227104 | chr1_3 | 0 | . |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | chrY_9996 | 0 | . |
## | chrY | 13517892 | 13517917 | chrY_9997 | 0 | . |
## | chrY | 8010951 | 8010976 | chrY_9998 | 0 | . |
## | chrY | 7405376 | 7405401 | chrY_9999 | 0 | . |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Unstranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome.
## Considered unstranded due to these Strand values: '.'