Question

how to calculate correlations between sparse data

2

Entering edit mode

8 months ago

leranwangcs ▴ 150

Hi,

I have a continuous variable A which is not parse, and I have a group of continuous variables which are very sparse (some of them have only one non-0 value). I want to calculate the correlations between variable A vs each of the variable in the group. I used cor.test() from r package stats, in which the default test is Pearson test. However the results look not very trustable. One variable that has only one non-0 value shows the most significant correlation with the variable A based on the p value.

I wondered if I'm using the wrong test on this type of data? What is a better way to calculate their corelations?

Thanks!

Correlations • 1.2k views

ADD COMMENT • link updated 8 months ago by Mensur Dlakic ★ 29k • written 8 months ago by leranwangcs ▴ 150

1

Entering edit mode

Not sure if this has foundation in statistics.

I suggest you try doing a singular value decomposition on both datasets, then take the first 10 components and calculate the correlations of those vectors.

ADD REPLY • link 8 months ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Hmm, perhaps try a distance metric like mean squared deviation?

ADD REPLY • link 8 months ago by dsull ★ 7.4k

0

Entering edit mode

Thanks for the suggestion! Could you please give me some more details on how to do this?

Thanks so much!

ADD REPLY • link 8 months ago by leranwangcs ▴ 150

0

Entering edit mode

You will need truncated SVD for sparse data. Have your data matrix, select the number of components (I suggest 5-10), and that is pretty much it.

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

ADD REPLY • link 8 months ago by Mensur Dlakic ★ 29k

score 0 · Answer 1 · 2024-07-31

0

Entering edit mode

8 months ago

Jeremy ▴ 930

I would suggest setting method = 'spearman', which can detect non-linear correlations. With MSE, I think variables with more non-zero values will have a shorter distance, but I'm not sure that would really measure correlation.

ADD COMMENT • link 8 months ago by Jeremy ▴ 930

2

Entering edit mode

Spearman doesn't work well with sparsity -- it is based on ranking and if you have a bunch of zeroes, it's hard to rank. Kendall tau works better for a nonparametric approach I think.

The issue doesn't appear to be because of linearity, it appears to be because of sparsity.

Distance metrics are nice for measuring associations. If you look at the formula for R^2, it is actually a standardized version of the MSE, so I might suggest trying out different distance metrics.

ADD REPLY • link 8 months ago by dsull ★ 7.4k