Hi everyone,
I try to deal with parallelization and the R package "Snowfall" .
I have my R code which looks like (propr is a package to compute correlation between counts from metagenomics data)
library(propr) test<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rare> tsv", h=T, row.names=1, sep="\t") test<-t(test) propr<-propr(test, metric="rho")
The correlation matrix I want to generate is very huge (about 5To), that's why I try to get used to parallelization to compute it (I have 12T of memory and many CPUs, I work on a cluster, so the computing power is not a problem)
But I really don't understand how to incorporate my R code into a snowfall code.. Does someone would know how to do it?
Bests, Vincent
Thanks for answering me.
How could I know if propr implements parallel functionality ?
My code looks like this :
Does it look fine or not?
Thanks
No, that will not do anything. Also, why do you assign right with
->
(just curious).You should look up how
mclapply()
works. It functions in exactly the same way aslapply()
. Just looking at your code, it may be something like:In pseudo code: Apply the function propr() to t(data)
I do not know anything about
propr()
, though. What is it doing to your data? You should study how it works internally, if possible, to see how it could be parallelised in different ways.Sometimes one has to edit the internal code to enable parallelisation, like I did for clusGap: https://github.com/kevinblighe/clusGapKB
I don't really know why I assign right, I get used to I guess
propr() is a package which computes correlation between compositional data , so :
I do not see anything in the propr documentation that indicates that it is designed for parallel processing, so, even registering cores will have no effect.
I looked at the actual code of the function, too, and I can see that it is not doing anything related to parallel-processing. When you set it to do correlation, in fact, it just uses the
cor()
function from the base stats package.Did you not try the bigcor package?
I cannot see everything that you are trying at your console, so, my suggestions may be irrelevant.
Thanks for the answer
Actually, I tried bigcor which looks fine (I just tried on my computer with a reduced data set (correlation between 15 000 OTUs), I can't work on a cluster at the moment)
I will try on the cluster tomorrow on my main dataset (correlation between 900 000 OTUs), when I can allocate lots of more memory (until 12To)
Does the number of CPUs allocated to compute the correlations between the 900 000 OTUs is revelant?
The use of multiple CPU cores to calculate a correlation matrix can increase the speed [to generate the matrix]; however, it will depend on how the correlation function is designed. I actually wrote a parallelised correlation function in 2016 but I was not happy with it; so, I deleted it...
I think that bigcor can do it relatively quickly. It can do it by computing the correlations in sections and [I believe[ saving these to disk in order to save memory.
Of course, generating the correlation matrix is one thing... after, you will have to filter the data.