I had been doing analysis using DESeq2 such that I would get the results from DESeq2, do a FDR (adj p-value) cutoff of 0.05 and then from that resulting list do a log(2) cutoff of greater than 1 and less than -1. I will call this Method A, 'sequential cutoffs'
I have come to realize that the final list of genes obtained from this 'sequential cutoffs' way is not something that can be labelled as FDR 0.05, log(2) 1. Pardon the gross statistical ignorance, but this is not readily intuitive to a biochemist. If I want a list of genes that can be labelled as FDR 0.05, log(2)1, I now use lfcThreshold of 1 in my DESeq2 command and from those results do a FDR cutoff of 0.05. I will call this Method B.
However, in a comparison of gene lists obtained from a number of samples analyzed using each of the above mentioned ways (Method A and Method B), I found that using the lfcThreshold in the command always results in a smaller gene list than when the lfcThreshold is not used but two sequential cutoffs (FDR 0.05 followed by a log(2) 1 cutoff) is used.
Intuitively I would expect both methods to provide equal gene lists. Being a bit more statistically savvy would obviously be helpful to me, so I was looking for an explanation as to why Method B (using lfcThreshold) results in a smaller number of genes than Method A.
In general, user is recommended to list top significant genes either by fold change (in this case lfc- log fold change) or by statistical significance (P/FDR/Q/Adjusted p). In method A above, genes are first filtered by statistical significance, then by FC.
When one filters by statistical significance, resultant genes may or may not be under significant fold change filter. For eg if top 100 genes may be statistically significant, but most of them show expression levels (comparative) between 1 to -1, then the filter applied above will yield zero.
On the other hand if one filters by fold change first (method B), the top 100 expressed genes may or may not be statistically significant.
Both lists are different.
In case of B, it seems (to me) that gene list under FC (lfc here) selection have lesser statistically significance which is why most of the genes are getting filtered out.
For A, list seem to have statistically significant genes with above cutoff expression levels.
I am not sure I understand. In my thinking, to take from your example in which all of the statistically significant genes (ssg), p-adj 0.05 or less, had a log(2) of between 1 and -1, then in Method A the ssg make it thru the initial FDR 0.05 cutoff while all the log(2) 1 genes are deleted in this step because their p-adj is greater then 0.05. However, the ssg are then deleted in the second log(2) filtering step. The final gene list would have 0 genes.
In Method B, in this particular case where all statistically significant genes (p-adj 0.05 or less) have a log(2) of between 1 and -1, then none of the ssg would appear in the DESeq2 output because lfcThreshold 1 was used, and the output would contain no genes with a p-adj of 0.05 or less. So, with Method B the final gene list would also have 0 genes.
But, this does get me thinking that p-adj may be calculated from all genes in Method A, but only from genes with log(2) greater than 1, less than -1 in Method B. The p-adj then being calculated from a smaller list of genes in Method B. But why would calculation of p-adj from a smaller list result in a smaller list of genes in Method B ?
well, I was trying to convey that all statistically significant genes are not expression significant (lfc cut off) and all expression significant (in this case lfc cutoff) are not statistically significant. Both are two different lists. Example I gave applies only Method A and I was saying that statistically significance doesn't mean expression significance (in this case cut off).
Btw, could you please post DESeq code for filtering, for both methods A and B?
For Method A, the DESeq2 command is:
The results of above are written into a tab-delimited .txt file and all genes with a FDR > 0.05 are removed. With the remaining genes, all genes with a log(2) of > 1 or < -1 are then removed.
For Method B, the DESeq2 command is:
The results of above are written into a tab-delimited .txt file and all genes with an FDR > 0.05 are removed.
As written, the log2 fold-change threshold that you're using in Method B is 0. Try lfcThreshold=1.