23 Statistics: computing the Matthews correlation coeficcient between ranges

The Matthews correlation coefficient is a correlation coefficient that works well when the classes of data are really imbalanced. For this reason it works well when you want to compare the correlation between two ranges.

If you want to compute the MCC between two or more ranges, you can use pr.stat.mcc. You need to give the chromosome sizes as a pyrange. You can get these using pyranges_db.

import pyranges as pr
gr = pr.data.chipseq()
gr2 = pr.data.chipseq_background()
chromsizes = pr.data.chromsizes()
mcc = pr.stats.mcc([gr, gr2], labels="chip input".split(), genome=chromsizes, strand=True)
print(mcc)
##        T      F Strand      TP      FP          TN      FN       MCC
## 0   chip   chip      +  125235       0  3095568748       0  1.000000
## 1   chip   chip      -  122745       0  3095571238       0  1.000000
## 2   chip  input      +       3  114576  3095454172  125232 -0.000014
## 4   chip  input      -       0  118126  3095453112  122745 -0.000039
## 3  input   chip      +       3  125232  3095454172  114576 -0.000014
## 5  input   chip      -       0  122745  3095453112  118126 -0.000039
## 6  input  input      +  114579       0  3095579404       0  1.000000
## 7  input  input      -  118126       0  3095575857       0  1.000000

If you want to create a symmetric matrix from the result:

print(print(mcc.set_index(["Strand", "T", "F"]).MCC.unstack()))
# or just mcc.set_index(["T", "F"]).MCC.unstack() in the unstranded case
## F                 chip     input
## Strand T                        
## +      chip   1.000000 -0.000014
##        input -0.000014  1.000000
## -      chip   1.000000 -0.000039
##        input -0.000039  1.000000
## None