18 Finding the k closest intervals

With the k_nearest-method, you can search for the k nearest intervals in other that is nearest the ones in self.

import pyranges as pr
gr = pr.data.chipseq()
gr2 = pr.data.chipseq_background()
print(gr.k_nearest(gr2, suffix="_Input"))
## +--------------+-----------+-----------+------------+-----------+-------+
## | Chromosome   | Start     | End       | Name       | Score     | +7    |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | ...   |
## |--------------+-----------+-----------+------------+-----------+-------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | ...   |
## | chr1         | 169887529 | 169887554 | U0         | 0         | ...   |
## | chr1         | 216711011 | 216711036 | U0         | 0         | ...   |
## | chr1         | 144227079 | 144227104 | U0         | 0         | ...   |
## | ...          | ...       | ...       | ...        | ...       | ...   |
## | chrY         | 15224235  | 15224260  | U0         | 0         | ...   |
## | chrY         | 13517892  | 13517917  | U0         | 0         | ...   |
## | chrY         | 8010951   | 8010976   | U0         | 0         | ...   |
## | chrY         | 7405376   | 7405401   | U0         | 0         | ...   |
## +--------------+-----------+-----------+------------+-----------+-------+
## Stranded PyRanges object has 10,000 rows and 12 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 7 hidden columns: Strand, Start_b, End_b, Name_b, Score_b, Strand_b, Distance

The nearest method takes a strandedness option, which can either be "same", "opposite" or False/None

print(gr.nearest(gr2, suffix="_Input", strandedness="opposite"))
## +--------------+-----------+-----------+------------+-----------+-------+
## | Chromosome   | Start     | End       | Name       | Score     | +7    |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | ...   |
## |--------------+-----------+-----------+------------+-----------+-------|
## | chr1         | 226987592 | 226987617 | U0         | 0         | ...   |
## | chr1         | 1541598   | 1541623   | U0         | 0         | ...   |
## | chr1         | 1599121   | 1599146   | U0         | 0         | ...   |
## | chr1         | 3504032   | 3504057   | U0         | 0         | ...   |
## | ...          | ...       | ...       | ...        | ...       | ...   |
## | chrY         | 21751211  | 21751236  | U0         | 0         | ...   |
## | chrY         | 21910706  | 21910731  | U0         | 0         | ...   |
## | chrY         | 22054002  | 22054027  | U0         | 0         | ...   |
## | chrY         | 22210637  | 22210662  | U0         | 0         | ...   |
## +--------------+-----------+-----------+------------+-----------+-------+
## Stranded PyRanges object has 10,000 rows and 12 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 7 hidden columns: Strand, Start_Input, End_Input, Name_Input, Score_Input, Strand_Input, ... (+ 1 more.)

The nearest method takes four further options, namely how, overlap, ties and k. How can take the values None, "upstream", "downstream".“upstream”and“downstream”are always in reference to the PyRange the method is called on. The default isNone`, which means that PyRanges looks in both directions. The overlap argument is a bool which indicates whether you want to include overlaps or not. Ties is the method in which you want to resolve ties, that is intervals with an equal distance to your query interval. The options are None which means that you get all ties. This might be more than k if there are multiple intervals with the same distance. The options “first” and “last” gives you the first or last interval for each separate distance. The option “different” gives you all nearest intervals from k different distances. k is the number of different intervals you want to find. It can be a vector with the length of the query vector.

import pyranges as pr
gr = pr.data.chipseq()
gr2 = pr.data.chipseq_background()
gr.k_nearest(gr2, suffix="_Input", k=[1, 2] * 5000).print()
## +--------------+-----------+-----------+------------+-----------+-------+
## | Chromosome   | Start     | End       | Name       | Score     | +7    |
## | (category)   | (int32)   | (int32)   | (object)   | (int64)   | ...   |
## |--------------+-----------+-----------+------------+-----------+-------|
## | chr1         | 212609534 | 212609559 | U0         | 0         | ...   |
## | chr1         | 169887529 | 169887554 | U0         | 0         | ...   |
## | chr1         | 169887529 | 169887554 | U0         | 0         | ...   |
## | chr1         | 216711011 | 216711036 | U0         | 0         | ...   |
## | ...          | ...       | ...       | ...        | ...       | ...   |
## | chrY         | 13517892  | 13517917  | U0         | 0         | ...   |
## | chrY         | 8010951   | 8010976   | U0         | 0         | ...   |
## | chrY         | 7405376   | 7405401   | U0         | 0         | ...   |
## | chrY         | 7405376   | 7405401   | U0         | 0         | ...   |
## +--------------+-----------+-----------+------------+-----------+-------+
## Stranded PyRanges object has 15,000 rows and 12 columns from 24 chromosomes.
## For printing, the PyRanges was sorted on Chromosome and Strand.
## 7 hidden columns: Strand, Start_b, End_b, Name_b, Score_b, Strand_b, Distance

Note that nearest intervals that are upstream of the query interval have a negative distance.