how to calculate correlations between sparse data
1
2
Entering edit mode
6 weeks ago
leranwangcs ▴ 140

Hi,

I have a continuous variable A which is not parse, and I have a group of continuous variables which are very sparse (some of them have only one non-0 value). I want to calculate the correlations between variable A vs each of the variable in the group. I used cor.test() from r package stats, in which the default test is Pearson test. However the results look not very trustable. One variable that has only one non-0 value shows the most significant correlation with the variable A based on the p value.

I wondered if I'm using the wrong test on this type of data? What is a better way to calculate their corelations?

Thanks!

Correlations • 488 views
ADD COMMENT
1
Entering edit mode

Not sure if this has foundation in statistics.

I suggest you try doing a singular value decomposition on both datasets, then take the first 10 components and calculate the correlations of those vectors.

ADD REPLY
0
Entering edit mode

Hmm, perhaps try a distance metric like mean squared deviation?

ADD REPLY
0
Entering edit mode

Thanks for the suggestion! Could you please give me some more details on how to do this?

Thanks so much!

ADD REPLY
0
Entering edit mode

You will need truncated SVD for sparse data. Have your data matrix, select the number of components (I suggest 5-10), and that is pretty much it.

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

ADD REPLY
0
Entering edit mode
6 weeks ago
Jeremy ▴ 910

I would suggest setting method = 'spearman', which can detect non-linear correlations. With MSE, I think variables with more non-zero values will have a shorter distance, but I'm not sure that would really measure correlation.

ADD COMMENT
2
Entering edit mode

Spearman doesn't work well with sparsity -- it is based on ranking and if you have a bunch of zeroes, it's hard to rank. Kendall tau works better for a nonparametric approach I think.

The issue doesn't appear to be because of linearity, it appears to be because of sparsity.

Distance metrics are nice for measuring associations. If you look at the formula for R^2, it is actually a standardized version of the MSE, so I might suggest trying out different distance metrics.

ADD REPLY

Login before adding your answer.

Traffic: 1425 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6