5 Manipulating the data in PyRanges

PyRanges is a thin wrapper around genomic data contained in pandas dataframes. This dataframe is accessible with the df attribute of the PyRanges object.

import pyranges as pr
gr = pr.data.chipseq()
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | 0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | 0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | U0         | 0         | -            |
## | chrY         | 13517892  | 13517917  | U0         | 0         | -            |
## | chrY         | 8010951   | 8010976   | U0         | 0         | -            |
## | chrY         | 7405376   | 7405401   | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
print(gr.df.head(5))
##   Chromosome      Start        End Name  Score Strand
## 0       chr1  212609534  212609559   U0      0      +
## 1       chr1  169887529  169887554   U0      0      +
## 2       chr1  216711011  216711036   U0      0      +
## 3       chr1  144227079  144227104   U0      0      +
## 4       chr1  148177825  148177850   U0      0      +

To access a column of this dataframe, you can ask for the name directly from the PyRanges object.

print(gr.Start.head())
## 18     212609534
## 70     169887529
## 129    216711011
## 170    144227079
## 196    148177825
## Name: Start, dtype: int32

You can directly insert a column by setting the attribute on the PyRanges object:

gr.stupid_example = "Hi There!"
print(gr)
## +--------------+-----------+-----------+------------+-----------+-------+
## | Chromosome   | Start     | End       | Name       | Score     | +2    |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | ...   |
## |--------------+-----------+-----------+------------+-----------+-------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | ...   |
## | chr1         | 169887529 | 169887554 | U0         | 0         | ...   |
## | chr1         | 216711011 | 216711036 | U0         | 0         | ...   |
## | chr1         | 144227079 | 144227104 | U0         | 0         | ...   |
## | ...          | ...       | ...       | ...        | ...       | ...   |
## | chrY         | 15224235  | 15224260  | U0         | 0         | ...   |
## | chrY         | 13517892  | 13517917  | U0         | 0         | ...   |
## | chrY         | 8010951   | 8010976   | U0         | 0         | ...   |
## | chrY         | 7405376   | 7405401   | U0         | 0         | ...   |
## +--------------+-----------+-----------+------------+-----------+-------+
## Stranded PyRanges object has 10,000 rows and 7 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 2 hidden columns: Strand, stupid_example
gr = gr.drop("stupid_example")
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | 0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | 0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | U0         | 0         | -            |
## | chrY         | 13517892  | 13517917  | U0         | 0         | -            |
## | chrY         | 8010951   | 8010976   | U0         | 0         | -            |
## | chrY         | 7405376   | 7405401   | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

And as you can see, you can drop columns from the PyRanges using drop. Using drop by itself removes all metadata-columns.

To insert a dataframe into a PyRanges object, you can use insert. Insert supports an optional insertion index loc:

df = pd.DataFrame({"A1": [1, 2] * 5000, "A2": [3, 4, 5, 6] * 2500})
print(df.head())
##    A1  A2
## 0   1   3
## 1   2   4
## 2   1   5
## 3   2   6
## 4   1   3
print(gr.insert(df))
## +--------------+-----------+-----------+------------+-----------+-------+
## | Chromosome   | Start     | End       | Name       | Score     | +3    |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | ...   |
## |--------------+-----------+-----------+------------+-----------+-------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | ...   |
## | chr1         | 169887529 | 169887554 | U0         | 0         | ...   |
## | chr1         | 216711011 | 216711036 | U0         | 0         | ...   |
## | chr1         | 144227079 | 144227104 | U0         | 0         | ...   |
## | ...          | ...       | ...       | ...        | ...       | ...   |
## | chrY         | 15224235  | 15224260  | U0         | 0         | ...   |
## | chrY         | 13517892  | 13517917  | U0         | 0         | ...   |
## | chrY         | 8010951   | 8010976   | U0         | 0         | ...   |
## | chrY         | 7405376   | 7405401   | U0         | 0         | ...   |
## +--------------+-----------+-----------+------------+-----------+-------+
## Stranded PyRanges object has 10,000 rows and 8 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 3 hidden columns: Strand, A1, A2
print(gr.insert(df, loc=3))
## +--------------+-----------+-----------+-----------+-----------+-------+
## | Chromosome   | Start     | End       | A1        | A2        | +3    |
## | (category)   | (int32)   | (int32)   | (int64)   | (int64)   | ...   |
## |--------------+-----------+-----------+-----------+-----------+-------|
## | chr1         | 212609534 | 212609559 | 1         | 3         | ...   |
## | chr1         | 169887529 | 169887554 | 2         | 4         | ...   |
## | chr1         | 216711011 | 216711036 | 1         | 5         | ...   |
## | chr1         | 144227079 | 144227104 | 2         | 6         | ...   |
## | ...          | ...       | ...       | ...       | ...       | ...   |
## | chrY         | 15224235  | 15224260  | 1         | 3         | ...   |
## | chrY         | 13517892  | 13517917  | 2         | 4         | ...   |
## | chrY         | 8010951   | 8010976   | 1         | 5         | ...   |
## | chrY         | 7405376   | 7405401   | 2         | 6         | ...   |
## +--------------+-----------+-----------+-----------+-----------+-------+
## Stranded PyRanges object has 10,000 rows and 8 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 3 hidden columns: Name, Score, Strand

To rename the columns you can use the set_columns() method.

If you want to remove duplicates based on position, you can use drop_duplicate_positions:

print(gr.drop_duplicate_positions(strand=False)) # defaults to True
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | 0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | 0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | U0         | 0         | -            |
## | chrY         | 13517892  | 13517917  | U0         | 0         | -            |
## | chrY         | 8010951   | 8010976   | U0         | 0         | -            |
## | chrY         | 7405376   | 7405401   | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 9,924 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

All columns, including Chromosome, Start, End and Strand can be changed in any way you please and more metadata-columns can be added by setting it on the PyRanges object.

import pandas as pd
gr.Name = gr.Chromosome.astype(str) + "_" + pd.Series(range(len(gr)), index=gr.Chromosome.index).astype(str)
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | chr1_0     | 0         | +            |
## | chr1         | 169887529 | 169887554 | chr1_1     | 0         | +            |
## | chr1         | 216711011 | 216711036 | chr1_2     | 0         | +            |
## | chr1         | 144227079 | 144227104 | chr1_3     | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | chrY_9996  | 0         | -            |
## | chrY         | 13517892  | 13517917  | chrY_9997  | 0         | -            |
## | chrY         | 8010951   | 8010976   | chrY_9998  | 0         | -            |
## | chrY         | 7405376   | 7405401   | chrY_9999  | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
gr.Strand = "."
print(gr)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | chr1_0     | 0         | .            |
## | chr1         | 169887529 | 169887554 | chr1_1     | 0         | .            |
## | chr1         | 216711011 | 216711036 | chr1_2     | 0         | .            |
## | chr1         | 144227079 | 144227104 | chr1_3     | 0         | .            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | chrY_9996  | 0         | .            |
## | chrY         | 13517892  | 13517917  | chrY_9997  | 0         | .            |
## | chrY         | 8010951   | 8010976   | chrY_9998  | 0         | .            |
## | chrY         | 7405376   | 7405401   | chrY_9999  | 0         | .            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Unstranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome.
## Considered unstranded due to these Strand values: '.'