Hello guys,
I want to compute correlations between OTUs (from metagenomic sequencing) . I am working with R and I would like to parallelize the computations because my data set is big.
What I want to do is maybe simple. I work on a cluster with Slurm (I have many CPUs and a lot of memory). I would like to cut my data set into subsets (by incrementing of 500 OTUs for example), and then, compute the correlations between each OTUs of the different subsets.
The main goal of this is to set a number of CPUs and RAM (for example 32 CPUs with 260 Go of RAM), start the computations of correlation between my subsets (500 OTUs, 1000 OTUs, 1500 OTUs... until 5000 OTUs) and look at the scalability. I will change the number of CPUs and the amount of RAM.
At the end, I will conclude by : "with 32 CPUs and 260 Go RAM, I can compute the correlation between X OTUs", "with 124 CPUs and 1To RAM, I can compute the correlation between X OTUs" ...
My problem is that I don't really know how to organize my code, and how to use the package mclapply() to parallelize.
library(optparse)
args <- commandArgs(trailingOnly = F)
# get options
option_list = list(
make_option(c("-s", "--subset"), type="character", default=NULL,
help="Input file matrix ")
);
opt_parser = OptionParser(usage = "Usage: %prog -f [FILE]",option_list=option_list,
description= "Description:")
opt = parse_args(opt_parser)
library(parallel)
library(doParallel)
library(compositional)
data=read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rares.tsv", row.names=1, h=T, sep="\t")
data=clr(data) #log_ratio transformation
data=data[1:opt$subset,]
data=t(data)
cores<-32
options('mc.cores'=cores)
registerDoParallel(cores)
parallel=mclapply(cor(data, c='spearman'))
save(parallel, file=sprintf("/home/vipailler/PROJET_M2/data/parallel_%s.RData", opt$subset))
Vincent, I have edited my answer. Please read it again (at the top). Another user identified a better approach to do what you need.