LAST UPDATED: November 2nd, 2019
As R is becoming ever more used in bioinformatics, the need for parallel processing is greater due to the sheer amounts of data that is being produced.
This tutorial is a small introduction to how you can start to perform parallel processing in R.
1. set-up system for parallel processing
[a] load doParallel and associated packages
require(doParallel)
[b] set cores / threads
cores <- 12
Note: you can automatically detect the number of CPU cores with
cores <- makeCluster(detectCores(), type='PSOCK')
[c] detect system and then register the cores appropriately
if user is on Windows, cores / threads for certain functions have to be registered differently, i.e., as a cluster object that implements SNOW functionality. If on Mac / Linux, they are registered via mc.cores
and as a number of cores to registerDoParallel()
system <- Sys.info()['sysname']
cl <- NULL
if (system == 'Windows') {
cl <- makeCluster(getOption('cl.cores', cores))
registerDoParallel(cl)
registerDoSEQ()
on.exit(stopCluster(cl))
} else {
options('mc.cores' = cores)
registerDoParallel(cores)
}
[d] parallelise lapply()
If on Windows, use:
parLapply(cl, `)
If on Mac / Linux:
mclapply(`)
[e] parallelise a 'for' loop with foreach
foreach(i = 1:10000) %dopar% {
prcomp(matrix(rexp(200, rate = .1 * i), ncol = 200))$x
}
2. Practical examples
[a], regression analysis (any type of lm, glm, clogit, et cetera)
NB - code removed as this is now implemented as Bioconductor package:
[b], correlation analysis
(i) Generate 10000 x 2000 random data matrix
mat <- matrix(rexp(20000000, rate=.1), ncol=10000)
colnames(mat) <- paste0('Sample', 1:ncol(mat))
rownames(mat) <- paste0('Gene', 1:nrow(mat))
dim(mat)
[1] 2000 10000
mat[1:5,1:5]
Sample1 Sample2 Sample3 Sample4 Sample5
Gene1 0.2991302 16.2747625 12.0524225 50.1207356 14.180350
Gene2 9.5160190 5.1977788 0.2951713 8.6172124 4.022473
Gene3 4.1651056 0.5793664 9.3953203 0.8494895 19.985724
Gene4 8.5216108 15.1742012 7.2660707 7.8875059 6.572094
Gene5 8.9048996 4.7932319 1.6599490 1.2448652 25.325664
(ii) use foreach
to parallelise the correlation
res <- foreach(i = seq_len(ncol(mat)),
.combine = rbind,
.multicombine = TRUE,
.inorder = FALSE,
.packages = c('data.table', 'doParallel')) %dopar% {
cor(mat[,i], mat, method = 'pearson')
}
res[1:10,1:10]
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10
[1,] 1.000000000 -0.002229854 -0.019986425 0.012332275 -0.028186074 0.001989957 -0.023539602 0.019429899 0.018364991 -0.005468089
[2,] -0.002229854 1.000000000 -0.028642588 0.043449095 -0.042194471 0.022356480 0.033054025 -0.023498252 -0.032226185 -0.009321138
[3,] -0.019986425 -0.028642588 1.000000000 -0.027933870 0.006611151 0.047751182 -0.038504742 -0.015981761 -0.044780981 0.042516990
[4,] 0.012332275 0.043449095 -0.027933870 1.000000000 -0.019726310 -0.011895754 -0.004191525 0.039994518 0.011178578 -0.038820658
[5,] -0.028186074 -0.042194471 0.006611151 -0.019726310 1.000000000 0.010329296 -0.016766063 -0.004144883 0.032845623 -0.019617268
[6,] 0.001989957 0.022356480 0.047751182 -0.011895754 0.010329296 1.000000000 0.007669371 0.002309490 -0.010981713 -0.008976607
[7,] -0.023539602 0.033054025 -0.038504742 -0.004191525 -0.016766063 0.007669371 1.000000000 0.007287154 -0.021833100 0.004763376
[8,] 0.019429899 -0.023498252 -0.015981761 0.039994518 -0.004144883 0.002309490 0.007287154 1.000000000 0.018914405 -0.012672712
[9,] 0.018364991 -0.032226185 -0.044780981 0.011178578 0.032845623 -0.010981713 -0.021833100 0.018914405 1.000000000 0.004782734
[10,] -0.005468089 -0.009321138 0.042516990 -0.038820658 -0.019617268 -0.008976607 0.004763376 -0.012672712 0.004782734 1.000000000
(iii) check some results to ensure that we have done it right
table(round(cor(mat[,1:50]), 3) == round(res[1:50,1:50], 3))
TRUE
2500
table(round(cor(mat[,678:900]), 3) == round(res[678:900,678:900], 3))
TRUE
49729
table(cor(mat[,1], mat) == res[1,])
TRUE
10000
table(cor(mat[,6666], mat) == res[6666,])
TRUE
10000
[c], clusGapKB
This is an edit of the original clusGap, related to the Gap Statistic by Rob Tibshirani (https://statweb.stanford.edu/~gwalther/gap). The edited code is found on my GitHub account: https://github.com/kevinblighe/clusGapKB
Load testing airway data and process to rlog counts with DESeq2:
library(airway)
data('airway')
library("DESeq2")
dds <- DESeqDataSet(airway, design = ~ cell + dex)
dds <- DESeq(dds, betaPrior = FALSE)
rlogcounts <- rlog(dds, blind=TRUE) # or vstcounts <- vst(dds, blind=TRUE)
rlogcounts <- assay(rlogcounts)[1:5000,]
Run clusGapKB with 12 cores:
source("clusGapKB.R") # get from GitHub
gap <- clusGapKB(rlogcounts, FUNcluster=kmeans, K.max=20, B=250)
I learnt the
parallel
+mclapply
trick a few months ago, really useful in a few niche cases IMO.@Kevin Im using your enhanced function works pretty cool..
Good work, Krushnach! Which function in particular?
both clusgap and the correlation matrix calculation
Great. I have removed the correlation matrix function, though. It is 'on the shelf' for improvement.
Krushnach, I have re-added a correlation function that works more efficiently.
Great help thanks, Will gonna try this today.
Thanks. Suppose I was writing a script / package for which I was uncertain of user's OS beforehand. Is there a wrapper package that can choose which parallelisation tool to call?
Hey russhh, yes, I do that in my package RegParallel
Basically, you just need the code from part 1(abc) (above) in order to register cores / threads on any system. The OS type will then be stored in the variable
system
After that,
foreach
will work on any system. For parallelisedlapply
, you'll need to distinguish based on whether one is using Windows or not: