PCA for BIG DATA
3
0
Entering edit mode
2.5 years ago

Hi, I was wondering if there is an R package for PCA for big data. I'm working with a data frame with more than 80000 variables.

Thank you!

R PCA • 2.2k views
ADD COMMENT
0
Entering edit mode

What is the question you try to address with PCA and why is having so many variables a problem?

ADD REPLY
0
Entering edit mode

I personnally use FactoMineR but whatever the number of variables you have, prcomp() should work. The more you have variables the more it will take time but I do not see any other inconvenience

ADD REPLY
0
Entering edit mode

I attach the model I use to perform the PCA and the error I get.

enter image description here

ADD REPLY
0
Entering edit mode

Should it be installed outside R ?

ADD REPLY
2
Entering edit mode
2.5 years ago
4galaxy77 2.9k

I typically use irlba::prcomp_irlba for truncated principle components of large matrices.

https://www.rdocumentation.org/packages/irlba/versions/2.3.5/topics/prcomp_irlba

ADD COMMENT
2
Entering edit mode
2.5 years ago
Mensur Dlakic ★ 28k

As far as I know, none of the PCA implementations care about the number of variables. It will take longer and require more memory to calculate with 80000 than with 80 variables, but PCA is still one of the fastest dimensionality reduction techniques. It sounds like you are having a memory problem.

I just created a random dataset with 10000 points and 80000 features. That took about 25 minutes. Calculating first 50 PCs took altogether 48 minutes, of which most of the time (44 minutes) was spent on data loading and normalization.

So it can definitely be done assuming a computer with reasonable memory (I'd say 32-64 Gb depending on the number of data points). I know you didn't ask for python implementation, but just in case if R packages don't pan out:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Finally, as already suggested you may want to consider a truncated SVD (tSVD) to reduce the dataset before plugging it into PCA, although some PCA implementations already use tSVD (not the fastest approach). It is very likely that a majority of your features are not informative, and tSVD will make it more manageable for PCA and potentially other downstream applications.

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

ADD COMMENT
1
Entering edit mode

consider a truncated SVD (tSVD)

This is what the R package rARPACK allows to do (i.e. function svds()).

ADD REPLY
1
Entering edit mode
2.5 years ago

My PCAtools package is fine for 'big data', thanks to implementations by Aaron Lun. In it, PCA is actually performed via BiocSingular::runPCA(), which means, therefore, that it is also compute-parallelised enabled (enabled for compute parallelisation).

https://bioconductor.org/packages/release/bioc/html/PCAtools.html

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1675 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6