Hi, I was wondering if there is an R package for PCA for big data. I'm working with a data frame with more than 80000 variables.
Thank you!
Hi, I was wondering if there is an R package for PCA for big data. I'm working with a data frame with more than 80000 variables.
Thank you!
I typically use irlba::prcomp_irlba
for truncated principle components of large matrices.
https://www.rdocumentation.org/packages/irlba/versions/2.3.5/topics/prcomp_irlba
As far as I know, none of the PCA implementations care about the number of variables. It will take longer and require more memory to calculate with 80000 than with 80 variables, but PCA is still one of the fastest dimensionality reduction techniques. It sounds like you are having a memory problem.
I just created a random dataset with 10000 points and 80000 features. That took about 25 minutes. Calculating first 50 PCs took altogether 48 minutes, of which most of the time (44 minutes) was spent on data loading and normalization.
So it can definitely be done assuming a computer with reasonable memory (I'd say 32-64 Gb depending on the number of data points). I know you didn't ask for python implementation, but just in case if R packages don't pan out:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Finally, as already suggested you may want to consider a truncated SVD (tSVD) to reduce the dataset before plugging it into PCA, although some PCA implementations already use tSVD (not the fastest approach). It is very likely that a majority of your features are not informative, and tSVD will make it more manageable for PCA and potentially other downstream applications.
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
consider a truncated SVD (tSVD)
This is what the R package rARPACK allows to do (i.e. function svds()).
My PCAtools package is fine for 'big data', thanks to implementations by Aaron Lun. In it, PCA is actually performed via BiocSingular::runPCA()
, which means, therefore, that it is also compute-parallelised enabled (enabled for compute parallelisation).
https://bioconductor.org/packages/release/bioc/html/PCAtools.html
Kevin
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What is the question you try to address with PCA and why is having so many variables a problem?
I personnally use FactoMineR but whatever the number of variables you have, prcomp() should work. The more you have variables the more it will take time but I do not see any other inconvenience
I attach the model I use to perform the PCA and the error I get.
Should it be installed outside R ?