23 Statistics: computing the Matthews correlation coeficcient between ranges
The Matthews correlation coefficient is a correlation coefficient that works well when the classes of data are really imbalanced. For this reason it works well when you want to compare the correlation between two ranges.
If you want to compute the MCC between two or more ranges, you can use
pr.stat.mcc
. You need to give the chromosome sizes as a pyrange. You can get
these using pyranges_db.
import pyranges as pr
= pr.data.chipseq()
gr = pr.data.chipseq_background()
gr2 = pr.data.chromsizes()
chromsizes = pr.stats.mcc([gr, gr2], labels="chip input".split(), genome=chromsizes, strand=True)
mcc print(mcc)
## T F Strand TP FP TN FN MCC
## 0 chip chip + 125235 0 3095568748 0 1.000000
## 1 chip chip - 122745 0 3095571238 0 1.000000
## 2 chip input + 3 114576 3095454172 125232 -0.000014
## 4 chip input - 0 118126 3095453112 122745 -0.000039
## 3 input chip + 3 125232 3095454172 114576 -0.000014
## 5 input chip - 0 122745 3095453112 118126 -0.000039
## 6 input input + 114579 0 3095579404 0 1.000000
## 7 input input - 118126 0 3095575857 0 1.000000
If you want to create a symmetric matrix from the result:
print(print(mcc.set_index(["Strand", "T", "F"]).MCC.unstack()))
# or just mcc.set_index(["T", "F"]).MCC.unstack() in the unstranded case
## F chip input
## Strand T
## + chip 1.000000 -0.000014
## input -0.000014 1.000000
## - chip 1.000000 -0.000039
## input -0.000039 1.000000
## None