I'm using edgeR for DE analysis of RNASeq data. I've been using manual filtering suggested in the EdgeR manual till now.
I followed genefilter package and tried to chose a theta that has the optimal number of rejections. Basically this is what I did: run edgeR once for the data set, then use the p-value results for genefilter calculations and plots.
dge <- DGEList(counts=counts[,!is.na(group)], group=group[!is.na(group)])
d <- dge
d <- calcNormFactors(d)
d <- estimateDisp(d, trend="none", robust=TRUE)
et <- exactTest(d, pair = levels(d$samples$group))
dgeTable = topTags(et, n=nrow(d))
res = data.frame(
filterstat = rowSums(d$counts[rownames(dgeTable),]),
pvalue = as.data.frame(dgeTable)$PValue,
row.names = rownames(dgeTable))
rejBH = filtered_R(alpha=alpha, filter=res$filterstat, test=res$pvalue, theta=theta, method="BH")
theta_chosen = theta[max(which(rejBH==max(rejBH)))]
use = res$filterstat > quantile(res$filterstat, probs=theta_chosen)
dgeFilt=dge[use,]
d <- dgeFilt
I used rowSums as filtering criteria, and the after I filter out the genes - in this case I got theta = 0.19, I ran the edgeR exact test again:
### rerun the model
d <- dgeFilt
d <- calcNormFactors(d)
d <- estimateDisp(d, trend="none", robust=TRUE)
et <- exactTest(d, pair = levels(d$samples$group))
My question is - even though the previous calculated max(rejBH) - the max number of rejections is 102 , using the same alpha (0.1) I'm only getting 60 DE genes after applying the filter. Actually, without applying any filter, the number of DE genes is 75 (this matches rejBH at 0%).
- Why is the new DE genes number not 102?
- Why am I getting less DE genes even though from the rejection plots it looks like I'll have the max number of rejections at theta = 0.19?
Am I doing anything wrong here or did I misunderstand something?
Thanks in advance for any suggestion!
Thank you!!
Hi Devon, this is probably kind of a stupid question but I was wondering - wouldn't not rerunning the model kind of like "cheating" ? I guess my original thought was that we filter out the genes that do not pass the filter and just run the model as if they weren't there? My understanding was that when using manual filtering suggested in the edgeR manual, the model was fit after the genes were filtered?
Thank you!
That's not at all a stupid question! Firstly, have a read through the Bourgon et al. paper that layed out the theory behind all of this (they're using microarrays, but it's the same in RNAseq). The group that wrote that went on the make the genefilter Bioconductor package that DESeq2 (also written by some of those authors) uses to do this.
In short, just because we don't have enough signal from a gene to bother testing it for DE doesn't mean that it's not useful for (A) library normalization or (B) variance shrinkage. Yes, the method used in the edgeR vignette prefilters by count (actually, cpm, if I recall correctly) and then fits the remainder. If you search the bioconductor archive, you'll find at least one discussion between the edgeR and DESeq2/genefilter authors about this.
Thanks a lot Devon!