Hi , I am trying to perform a big number of case control status permutations, this is my code:
Y<-matrix(0,1285,100000)
Y<-apply(Y,c(1,2), function(x) sample(c(0,1),1))
NAMES <- names(df[,51:7298])
res<-as.data.frame(matrix(nrow=100000,ncol=length(NAMES),dimnames=NULL))
for (s in 1:100000) {
for(i in 51:7298)
{
fit<-glm(Y[,s]~df$cov1+df[,i]+df$cov2+ df$cov3,data=df,family=binomial())
sfit <-summary(fit)
j=i-50
res[s,j]<-as.numeric(sfit$coefficients[3,4])
colnames(res)<-NAMES
}}
It's too slow under R, will it be better to use C++ or Python ? I have a beginner level in both Python and C++language, any script suggestions are welcome !
Thank you
If time really matters, C++ is the best.
Yours is an embarrassingly parallel problem. Learning to recognize such cases and how to handle them will go a long way in helping you deal with heavy computations. As already mentioned, check the various R packages available for this. Also this is where you'd make use of a compute cluster if you've access to one. Recoding in a fast language (C, C++...but not python) should be the absolute last resort (e.g. when all you can use is a laptop) or done as an exercise.
Thank you very much Jean-Karim, The problem that I am failing even on a compute cluster. The loop I wrote take a lot of memory exceeding the cluster performance (Or may be I am not using the best way ... ) That's way I am lokking how to mek better this or to write it in anothjer language ... C++ still not easy for me as a beginner
Start by ensuring that the outer loop sends jobs to different nodes of your cluster. Rewriting without parallelization may not solve your memory issue if your cluster nodes don't have the required RAM. The way your code is written you keep data for each run in memory so your memory use keeps on growing. Your res matrix is about 5.5 GB, your Y matrix is 1 GB so this piece of code needs at least 7 GB without including the data in df and space required to compute the regression model. Also stick with matrices if you don't need the flexibility/features of data frames. Accessing elements of a data frame can be several orders of magnitude slower than with a matrix as you can see with this:
On my machine, this shows the data frame to be ~50 times slower than the matrix.