Hi.
I'm using edgeR for differential expression analysis of RNASeq data (counts downloaded from TCGA).
The workflow I'm using follows the GLM functionality of edgeR, as described in the vignette for edgeR.
dge <- DGEList(...)
design <- model.matrix(...) # non-singular
y <- estimateGLMCommonDisp(dge, design)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef = ncol(design) + c(-1,0))
Fitting the model, as above, can take ages on our server - seemingly because the dataset is pretty large, the design matrix is pretty large etc etc AND, only a single processor is used by edgeR.
I can't find any options within edgeR that can take advantage of parallel processing and was wondering whether any of the alternative RNASEq expression packages, such as DESeq, had this capability
All the best
Russ
DESeq2 doesn't have any option for parallelism either. You could add this though (that's actually how DEXseq's parallelism works). You could also do the same with edgeR, though.
BTW, TCGA data is often from RSEM, which won't be integers. In that case, use limma/voom instead.
The data I'm using is count based, it's a release from a while ago; the more recent lung RNASeq data is 'normalised' which basically destroys my chances of using the most appropriate tools for RNASeq data (the newer gene-level release even mocks me with a 'raw counts' column... of floats; though I believe you can munge the exon-level data into gene-level counts).
Consider using limma::voom even for the "real count" based data, though -- especially on large datasets. The "voom step" is lightening fast as compared to the "estimate dispersion" steps that both edgeR and DESeq(2) have to run through ... this depends on the size of the dataset, but we're talking less than 1 minute for the entirety of a limma::voom pipeline (variance estimation through lmFit and testing contrasts) to several (3+) hours for just estimating the dispersion (with DESeq2, at least -- I suspect edgeR would be close to the same speed).
Thanks both for recommending limma/voom. I think this is probably the best way to go.
I attempted to speed edgeR up by splitting the dataset, fitting the dispersions in each split and then merging the data back together. I suspected this was flawed but tried it anyway and, though it was faster, the trended dispersions were quite different within the tails of average log CPM (way more than I would have expected).
So limma / voom looks like a really good option for me. Regards, Russ
Just checking :) There have been a lot of posts both here and elsewhere with people trying to use "raw counts" that turn out to not be so raw.
Thanks, looks like my MPI/lapack setup is flawed (though I intiially thought I could alter the R source to use mclapply or similar)
You should be able to just use bplapply (see the BiocParallel package), though you'll need to split things on your own. Keep in mind that this will mostly only work with the variance estimation, though that's probably the bottleneck anyway.
Based on some detective work, those floating point "raw_counts" in the MAGE-TAB export appear to be the estimated count output from RSEM. RSEM does not simply output the number of fragments (equal to read count if single-end seq) that overlap with a gene by some definition, but includes an estimation of the contribution from reads that map to multiple locations using an Expectation-Maximization (EM) algorithm.
The author of RSEM suggests that these estimated counts can be used in methods expecting read counts (such as edgeR) by rounding them to the nearest integer value. I suspect this rescue method might work in many cases, but add substantial noise to some genes.
It should be noted that the authors of tools like edgeR typically prefer people not do that and instead use things like limma::voom (I can't count how many times I've seen Gordon Smyth write that).
maybe you could first exclude rows that have too low read counts in order to reduce amount of data