edgeR / deSeq parallel usage
1
1
Entering edit mode
10.2 years ago
russhh 5.7k

Hi.

I'm using edgeR for differential expression analysis of RNASeq data (counts downloaded from TCGA).

The workflow I'm using follows the GLM functionality of edgeR, as described in the vignette for edgeR.

dge <- DGEList(...)

design <- model.matrix(...) # non-singular

y <- estimateGLMCommonDisp(dge, design)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef = ncol(design) + c(-1,0))

Fitting the model, as above, can take ages on our server - seemingly because the dataset is pretty large, the design matrix is pretty large etc etc AND, only a single processor is used by edgeR.

I can't find any options within edgeR that can take advantage of parallel processing and was wondering whether any of the alternative RNASEq expression packages, such as DESeq, had this capability

All the best

Russ

R edgeR RNA-Seq • 6.9k views
ADD COMMENT
1
Entering edit mode

DESeq2 doesn't have any option for parallelism either. You could add this though (that's actually how DEXseq's parallelism works). You could also do the same with edgeR, though.

BTW, TCGA data is often from RSEM, which won't be integers. In that case, use limma/voom instead.

ADD REPLY
0
Entering edit mode

The data I'm using is count based, it's a release from a while ago; the more recent lung RNASeq data is 'normalised' which basically destroys my chances of using the most appropriate tools for RNASeq data (the newer gene-level release even mocks me with a 'raw counts' column... of floats; though I believe you can munge the exon-level data into gene-level counts).

ADD REPLY
2
Entering edit mode

Consider using limma::voom even for the "real count" based data, though -- especially on large datasets. The "voom step" is lightening fast as compared to the "estimate dispersion" steps that both edgeR and DESeq(2) have to run through ... this depends on the size of the dataset, but we're talking less than 1 minute for the entirety of a limma::voom pipeline (variance estimation through lmFit and testing contrasts) to several (3+) hours for just estimating the dispersion (with DESeq2, at least -- I suspect edgeR would be close to the same speed).

ADD REPLY
0
Entering edit mode

Thanks both for recommending limma/voom. I think this is probably the best way to go.

I attempted to speed edgeR up by splitting the dataset, fitting the dispersions in each split and then merging the data back together. I suspected this was flawed but tried it anyway and, though it was faster, the trended dispersions were quite different within the tails of average log CPM (way more than I would have expected).

So limma / voom looks like a really good option for me. Regards, Russ

ADD REPLY
0
Entering edit mode

Just checking :) There have been a lot of posts both here and elsewhere with people trying to use "raw counts" that turn out to not be so raw.

ADD REPLY
0
Entering edit mode

Thanks, looks like my MPI/lapack setup is flawed (though I intiially thought I could alter the R source to use mclapply or similar)

ADD REPLY
0
Entering edit mode

You should be able to just use bplapply (see the BiocParallel package), though you'll need to split things on your own. Keep in mind that this will mostly only work with the variance estimation, though that's probably the bottleneck anyway.

ADD REPLY
0
Entering edit mode

(the newer gene-level release even mocks me with a 'raw counts' column... of floats; though I believe you can munge the exon-level data into gene-level counts).

Based on some detective work, those floating point "raw_counts" in the MAGE-TAB export appear to be the estimated count output from RSEM. RSEM does not simply output the number of fragments (equal to read count if single-end seq) that overlap with a gene by some definition, but includes an estimation of the contribution from reads that map to multiple locations using an Expectation-Maximization (EM) algorithm.

The author of RSEM suggests that these estimated counts can be used in methods expecting read counts (such as edgeR) by rounding them to the nearest integer value. I suspect this rescue method might work in many cases, but add substantial noise to some genes.

ADD REPLY
0
Entering edit mode

It should be noted that the authors of tools like edgeR typically prefer people not do that and instead use things like limma::voom (I can't count how many times I've seen Gordon Smyth write that).

ADD REPLY
0
Entering edit mode

maybe you could first exclude rows that have too low read counts in order to reduce amount of data

ADD REPLY
5
Entering edit mode
10.2 years ago
Michael Love ★ 2.6k

I'm currently working on a parallel wrapper for DESeq2. The dispersion estimation for parallel execution is easy, but I need a bit more time to restructure for the LFC calculation.

As Steve mentions, limma is very fast when you have many 100s of samples as in TCGA.

ADD COMMENT
3
Entering edit mode

In DESeq2 version >= 1.5.54 you can do:

library(BiocParallel)
register(MulticoreParam(10))
DESeq(dds, parallel=10)

See the BiocParallel package for the backend options. More information under ?DESeq.

ADD REPLY

Login before adding your answer.

Traffic: 2064 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6