10 Sorting PyRanges
pyranges are always sorted on chromosome and strand to enable faster operations.
pyranges can in addition be sorted on start and end by using the function sort:
import pyranges as pr
import pandas as pd
from io import StringIO
= pr.data.chipseq()
cs print(cs)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 212609534 | 212609559 | U0 | 0 | + |
## | chr1 | 169887529 | 169887554 | U0 | 0 | + |
## | chr1 | 216711011 | 216711036 | U0 | 0 | + |
## | chr1 | 144227079 | 144227104 | U0 | 0 | + |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | U0 | 0 | - |
## | chrY | 13517892 | 13517917 | U0 | 0 | - |
## | chrY | 8010951 | 8010976 | U0 | 0 | - |
## | chrY | 7405376 | 7405401 | U0 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
= cs.sort()
cs_sorted print(cs)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome | Start | End | Name | Score | Strand |
## | (category) | (int32) | (int32) | (object) | (int64) | (category) |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1 | 212609534 | 212609559 | U0 | 0 | + |
## | chr1 | 169887529 | 169887554 | U0 | 0 | + |
## | chr1 | 216711011 | 216711036 | U0 | 0 | + |
## | chr1 | 144227079 | 144227104 | U0 | 0 | + |
## | ... | ... | ... | ... | ... | ... |
## | chrY | 15224235 | 15224260 | U0 | 0 | - |
## | chrY | 13517892 | 13517917 | U0 | 0 | - |
## | chrY | 8010951 | 8010976 | U0 | 0 | - |
## | chrY | 7405376 | 7405401 | U0 | 0 | - |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
This will sort them on Chromosome, Strand, Start and then End.
Custom sorting can be done after you are done with the PyRanges-related functionality. Then you can extract the underlying dataframes with df and sort it:
= cs_sorted.df
df print(df.head())
## Chromosome Start End Name Score Strand
## 0 chr1 1541598 1541623 U0 0 +
## 1 chr1 1599121 1599146 U0 0 +
## 2 chr1 3504032 3504057 U0 0 +
## 3 chr1 3806532 3806557 U0 0 +
## 4 chr1 5079955 5079980 U0 0 +
= df.sort_values(["Chromosome", "Start", "End"])
df_sorted print(df_sorted.head(20))
## Chromosome Start End Name Score Strand
## 451 chr1 1325303 1325328 U0 0 -
## 0 chr1 1541598 1541623 U0 0 +
## 1 chr1 1599121 1599146 U0 0 +
## 452 chr1 1820285 1820310 U0 0 -
## 453 chr1 2448322 2448347 U0 0 -
## 454 chr1 3046141 3046166 U0 0 -
## 455 chr1 3437168 3437193 U0 0 -
## 2 chr1 3504032 3504057 U0 0 +
## 456 chr1 3637087 3637112 U0 0 -
## 457 chr1 3681903 3681928 U0 0 -
## 3 chr1 3806532 3806557 U0 0 +
## 458 chr1 3953790 3953815 U0 0 -
## 459 chr1 5037292 5037317 U0 0 -
## 4 chr1 5079955 5079980 U0 0 +
## 5 chr1 5233543 5233568 U0 0 +
## 6 chr1 5301327 5301352 U0 0 +
## 7 chr1 5431308 5431333 U0 0 +
## 8 chr1 5449222 5449247 U0 0 +
## 460 chr1 5481750 5481775 U0 0 -
## 461 chr1 5699351 5699376 U0 0 -
Now the df is sorted and start and end with interleaved strands.