Hi all,
I am doing a RNAseq analysis for DE genes. In my initial analysis, using the arbitrary cutoff of 1 CPM for filtering out lowly expressed genes, I found ~20 DE genes (i.e. genes with FDR = p.adjust(method="BH") < 0.05). I have been working to understand the BH correction, and in my exploration, have generated the following plot from my filtered dataset (retaining 14,000 genes after filtering):
Am I correct in understanding that this histogram implies that I have upwards of 1,000 DE genes? I am making this assumption by considering those counts over 500 for the two bins corresponding to p<=0.05, since the mean counts of the bins with p > 0.50 appears to be ~400 (so, selecting 500 as a conservative upper limit).
I understand that the BH adjustment must lose some true positive discoveries in its aims to bring the false positive rate to the advertised 5%, as well as the inverse dependence of the number of true positives lost with the overall number of comparisons made for large n. As such, I sought to emulate the DEseq2 method of setting the cutoff CPM value used for filtering so as to maximize the number of results with FDR<=0.05. I accomplished this with some ugly but effective 'for' loops, iterating through a range of cutoff CPM values and tracking the number of FDR<=0.05 results for each. When I did this around the range of cutoff CPM values suggested in the edgeR documentation (the recommendation being to filter out genes with less than 10-15 counts per sample, which for my samples corresponds to 0.7-1.1 CPM), I achieved the following results:
I repeated this using a broader range of CPM values (my computer was not pleased to say the least), and was able to find that at Cutoff CPM = 18 I can get ~35 DE gene with FDR<=0.05. I am worried, however, that this cutoff CPM is too high and I may be missing out on some lowly, but differentially, expressed genes (I have biological reasons to believe this after examining these results, also)
In my research on finding appropriate filtering parameters in RNAseq analysis, I stumbled upon this paper on independent hypothesis weighting (IHW): https://www.nature.com/articles/nmeth.3885#MOESM282 . Considering the BH adjustment to possibly be incompatible with my data (if so, why?) I re-did my analysis with the IHW method as detailed in this paper. With no prior filtering I found 95 DE genes with adjusted p value <= 0.05, but I was unsure if filtering was necessary or not before such an IHW analysis (my logic was that the hypothesis weighting would filter out lowly expressed samples as needed).
I now feel as though I may be seeking a method that gives me the results that I want rather than using a predetermined, robust method, which I know to be a dangerous statistical practice (though this being my first such analysis makes this feel at least partially inevitable).
That said, why is the number of FDR<=0.05 genes so significantly less than the p-value histogram would suggest? Is this simply the nature of high-throughput/multiple-comparison testing and I am misunderstanding the math behind it? Or am I doing something wrong in my analysis?
If anybody could help me better understand this process or point me in the direction of papers, books, or any other resources to help my understanding I would be greatly appreciative!
Best,
Brian
Make your life easy and use the
edgeR
filtering functionfilterByExpr
which automates this process. Why is FDR >>> p-value, yes because the multiple testing burden is large when testing thousands of genes.I have also experimented with the
filterByExpr()
function but was unaware that it automated this process. Does it automate when you don't pass in any arguments formin.count
andmin.total.count
? Because I had always been supplying my own values for those two parameters. Thank you for your response, that it reassuring to hear that this may just be the nature of such analyses.This filter has defaults that apply if you do not set a value. The only important thing is that you either give it
group
or the design matrix so it is aware of your design.That's great to hear, thank you!
The default settings for filterByExpr are specifically designed to keep as many genes in the analysis as possible without including genes with a very low chance of being DE.
propTrueNull(p.values)
allows you to estimate the number of false negatives.I never knew of that function, thank you for that! When I run that on my filtered data, I get the following:
Does this mean that 77% of my total genes (n=14,000) are truly not DE, or that 77% of my genes with p less than some cutoff are in reality not DE? I appreciate your help!
It means that 22% of all genes (22% of 14,000) are truly DE.
Note that this is a global estimate and it does not tell you which of the genes are DE. You cannot for example choose the 22% of genes with smallest p-values and claim them as DE. Genewise statistical significance can never give DE counts as high as that returned by propTrueNull because statistical testing has to control the FDR rate. Many genes may be DE to a small extent (hence included in propTrueNull's count) but with differences too small to detect at any worthwhile level of statistical significant in individual genewise testing (and hence not with a small FDR value).