When I filter my count data with the code in the user guide, the FDR for all my genes drops to 1.0. But, if I lower the CPM cutoff then my FDR creeps up and start to get significant DE genes. The manual states that the CPM is a user preference. And it would seem to be influenced by the library size and number of genes. How does one go about picking the correct cutoff without bias.
My experiment is in Xenopus Laevis, where I have three conditions and four biological replicates. Each sample has a mean count of 8809887.
Running with Filtering
dge=DGEList(edge_data, group = edge_meta$group, genes = column_to_rownames(edge_anno, 'ID'))
keep <- rowSums(cpm(dge)>1) >= 2
dge <- dge[keep, , keep.lib.sizes=FALSE]
design=model.matrix(~Rep+Ploidy, data=edge_meta)
design
dge=calcNormFactors(dge)
dge=estimateDisp(dge,design)
plotBCV(dge)
fit=glmQLFit(dge,design)
qlf=glmQLFTest(fit, coef=6)
topTags(qlf)
Coefficient: Ploidy3
logFC logCPM F PValue FDR
gene41545 2.0965805 2.448327 35.90115 4.746371e-05 0.9999619
gene49995 -0.7932641 4.473388 26.52267 1.941914e-04 0.9999619
gene15477 2.8440399 4.021164 26.31673 2.010534e-04 0.9999619
gene4260 1.1824923 4.668519 22.27339 4.144636e-04 0.9999619
gene50916 1.1398852 2.681193 19.93906 6.556942e-04 0.9999619
gene16190 1.6979906 5.217549 19.88785 6.625808e-04 0.9999619
gene16461 1.4116720 3.029733 17.98760 9.888933e-04 0.9999619
gene30142 2.2028513 1.304547 17.88017 1.012326e-03 0.9999619
gene9327 1.6603994 3.380958 17.84182 1.020847e-03 0.9999619
gene41018 0.5355154 4.970458 16.86752 1.267972e-03 0.9999619
Running w/o filtering
dge=DGEList(edge_data, group = edge_meta$group, genes = column_to_rownames(edge_anno, 'ID'))
#keep <- rowSums(cpm(dge)>1) >= 2
#dge <- dge[keep, , keep.lib.sizes=FALSE]
design=model.matrix(~Rep+Ploidy, data=edge_meta)
design
dge=calcNormFactors(dge)
dge=estimateDisp(dge,design)
plotBCV(dge)
fit=glmQLFit(dge,design)
qlf=glmQLFTest(fit, coef=6)
topTags(qlf)
Coefficient: Ploidy3
logFC logCPM F PValue FDR
gene15477 2.8412637 4.012968 147.49207 6.236814e-34 3.189070e-29
gene16190 1.6964975 5.212556 109.25227 1.443137e-25 3.689595e-21
gene42380 1.7616811 4.875927 101.21749 8.312275e-24 1.416772e-19
gene18119 1.8589354 4.609218 100.35729 1.283117e-23 1.640241e-19
gene14278 1.2991257 6.502732 99.87230 1.638995e-23 1.676135e-19
gene45604 1.0203924 7.505369 72.36337 1.797921e-17 1.532218e-13
gene4166 0.9549254 7.469934 61.14997 5.305508e-15 3.875522e-11
gene46483 0.8449512 9.177001 60.20210 8.586239e-15 5.488002e-11
gene18489 2.2123988 2.705238 56.55142 5.489566e-14 3.118867e-10
gene13964 1.3560655 4.548504 55.43617 9.679432e-14 4.949384e-10
Gene CPM in each library
The edgeR authors answer questions promptly on Bioconductor and prefer that the same question not to be cross-posted to other forums at the same time, because it takes more of our time to monitor and because questions may get left hanging. I answered your question on Bioconductor about an hour after you asked it.
If you find you do need to cross-post, please explain that as part of your question, otherwise people who take the time to answer your question may find afterwards that they've wasted their time.