Hello thanks for hearing out my question. So i am currently writing my master thesis and i am currently trying to recreat some previous RNA-seq analysis.
My design is simply the ~codition i have 8 samples of 2 different conditions (radiation no radiation) 4 each. I used the pre filter
keep <- rowSums(counts(dds) >= 10) >= 8
dds <- dds[keep,]
so that only genes remain in the analysis with atleast 10 reads in each sample.
Now the question is how do i filter my results ? I know that the cut offs used were a log 2 fold change of 1 or bigger and 0.05 as padj cut off.
I used`
res_filt = na.omit(res)
res_filt_upregulated = res_filt[res_filt$padj < 0.05 & res_filt$log2FoldChange > 1 ,]
res_filt_downregulated = res_filt[res_filt$padj < 0.05 & res_filt$log2FoldChange < -1 ,]
But to my understanding the result function can in itself aplly filters with alpha 0.05 and lfc threshold = 1 i should get the same result. But i dont first of all no matter if i aplly any filter at all the length of the result table stays the same but the padj values change. And if i look at the results using the summary function it shows only 54 genes instead of the 166 i find using the other filter.
res_mod = results(dds, lfcThreshold = 1, alpha= 0.05 )
I guess i just dont understand what exactly the result functions filter ability does and how to use it. I would be very grateful if someone would enlighten me.
Ah thank you that makes alot more sense :) the LFcthreshold is basically shifting my 0 hypothesis.
Then one more question what exactly does the alpha value do, how does it effect the other values ?
See this other answer by the authors. After testing your genes and getting p-values, DESeq2 performs a step which is called "independent filtering" and the goal is to remove genes with very low counts because these usually have low power to be detected as significant in the first place, because of high dispersion (you remove them even without looking at their p-values). If you remove them, you have less tests, so this benefits you when correcting for multiple testing for the rest of the genes. So, if you change
alpha
, you will change the "final" number of genes to be considered for the multiple testing correction, so the adjusted p-values will change.The
alpha
inresults
controls how this procedure is made, and it should be set to the threshold that you want to use in your adjusted p-values (e.g. 0.05). This is because, (in layman terms) what independent filtering does is to give you a minimum expression threshold for which, if you remove those genes, you will get more significant results in total, because you reduced your total number of tests (check out the vignette for a better explanation). Even if you may lose some significant gene with low expression, because most genes with low expression won't be significant, removing them will help you detect as significant a higher number of the rest of the genes, so you end up having more numerous significant results. Thealpha
is needed because the function has to know which genes you are going to consider significant, so that it can count the number of results to optimize your mean filtering threshold.Thanks alot :)