10 Sorting PyRanges

pyranges are always sorted on chromosome and strand to enable faster operations.

pyranges can in addition be sorted on start and end by using the function sort:

import pyranges as pr
import pandas as pd
from io import StringIO
cs = pr.data.chipseq()
print(cs)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | 0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | 0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | U0         | 0         | -            |
## | chrY         | 13517892  | 13517917  | U0         | 0         | -            |
## | chrY         | 8010951   | 8010976   | U0         | 0         | -            |
## | chrY         | 7405376   | 7405401   | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
cs_sorted = cs.sort()
print(cs)
## +--------------+-----------+-----------+------------+-----------+--------------+
## | Chromosome   | Start     | End       | Name       | Score     | Strand       |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
## |--------------+-----------+-----------+------------+-----------+--------------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | +            |
## | chr1         | 169887529 | 169887554 | U0         | 0         | +            |
## | chr1         | 216711011 | 216711036 | U0         | 0         | +            |
## | chr1         | 144227079 | 144227104 | U0         | 0         | +            |
## | ...          | ...       | ...       | ...        | ...       | ...          |
## | chrY         | 15224235  | 15224260  | U0         | 0         | -            |
## | chrY         | 13517892  | 13517917  | U0         | 0         | -            |
## | chrY         | 8010951   | 8010976   | U0         | 0         | -            |
## | chrY         | 7405376   | 7405401   | U0         | 0         | -            |
## +--------------+-----------+-----------+------------+-----------+--------------+
## Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.

This will sort them on Chromosome, Strand, Start and then End.

Custom sorting can be done after you are done with the PyRanges-related functionality. Then you can extract the underlying dataframes with df and sort it:

df = cs_sorted.df
print(df.head())
##   Chromosome    Start      End Name  Score Strand
## 0       chr1  1541598  1541623   U0      0      +
## 1       chr1  1599121  1599146   U0      0      +
## 2       chr1  3504032  3504057   U0      0      +
## 3       chr1  3806532  3806557   U0      0      +
## 4       chr1  5079955  5079980   U0      0      +
df_sorted = df.sort_values(["Chromosome", "Start", "End"])
print(df_sorted.head(20))
##     Chromosome    Start      End Name  Score Strand
## 451       chr1  1325303  1325328   U0      0      -
## 0         chr1  1541598  1541623   U0      0      +
## 1         chr1  1599121  1599146   U0      0      +
## 452       chr1  1820285  1820310   U0      0      -
## 453       chr1  2448322  2448347   U0      0      -
## 454       chr1  3046141  3046166   U0      0      -
## 455       chr1  3437168  3437193   U0      0      -
## 2         chr1  3504032  3504057   U0      0      +
## 456       chr1  3637087  3637112   U0      0      -
## 457       chr1  3681903  3681928   U0      0      -
## 3         chr1  3806532  3806557   U0      0      +
## 458       chr1  3953790  3953815   U0      0      -
## 459       chr1  5037292  5037317   U0      0      -
## 4         chr1  5079955  5079980   U0      0      +
## 5         chr1  5233543  5233568   U0      0      +
## 6         chr1  5301327  5301352   U0      0      +
## 7         chr1  5431308  5431333   U0      0      +
## 8         chr1  5449222  5449247   U0      0      +
## 460       chr1  5481750  5481775   U0      0      -
## 461       chr1  5699351  5699376   U0      0      -

Now the df is sorted and start and end with interleaved strands.