Hi,
I'm carrying out DE analysis on RNASeq data in DESeq2. I have a question about the histogram plots I used to plot the p-adj values of my results. I'm referring to this post; http://varianceexplained.org/statistics/interpreting-pvalue-histogram/ to best explain my histogram plots!
I have two sets of results; X and Y. Result X was calculated first with no log fold threshold and padj < 0.05. When I plotted the padj values of Result X on a histogram I got an Anti-conservative histogram (similar to A in the website I'm referring to) with a large number of padj values < 0.05.
Result X resulted in a large number of DE genes so I decided to switch to Result Y where a log fold threshold of 1 was applied and then an FDR <0.05. I now have a set of genes that is more managable to do DE analysis on, however my concern with Result Y is that when I plotted the histogram of the padj values it resulted in a conservative plot (similar to D in the website) with a larger number of padj values = 1 than those < 0.05.
My question is; is histogram Y a concern? Going by the post attached is there something wrong with my dataset or as he mentioned is applying a log fold threshold a correction method? As long as histogram X is ok then my data is good to do statistical analysis on? Also I'm trying to understand what's happening the padj values in Result Y when the log fold threshold is applied? Is it the genes excluded from the data because of the threshold and they have no padj value assigned?
EDIT:
I get the same histogram when I plot the p-values of both results.
I'm count data is stored in the "transcript" dataframe under 2 conditions(pre and post) and 3 replicates per group. Design matrix is stored in the "coldata" dataframe.
Here's the code:
dim(transcripts)
[1] 4874 6
head(coldata)
condition
T1_pre pre
T1_post post
T2_pre pre
T2_post post
T3_pre pre
T3_post post
dds <- DESeqDataSetFromMatrix(countData = transcripts,
colData = coldata,
design = ~ condition)
dds <- DESeq(dds)
#Result X
res <- results(dds)
> table(res$padj < 0.05)
FALSE TRUE
1095 3773
#Result Y
resLFC <- results(dds, lfcThreshold = 1)
> table(resLFC$padj < 0.05)
FALSE TRUE
3991 877
hist( res$pvalue, breaks=20, col="grey" )
hist( resLFC$pvalue, breaks=20, col="grey" )
Histogram of results no log fold threshold:
Histogram of results log fold threshold < 1:
You should plot the nominal p-values histogram ; not the adjusted ones. Also could you describe the different commands you used to build your results i.e. the DESeq2 commands and also define the number of samples, number of groups, experimental design, etc...
To quote the linked page:
Please do not add answers unless you're answering the top level question. You can edit your post and add these details. Your images are also added improperly. I'll clean up your post, but please put some more effort in the future.
How to add images to a Biostars post
I was checking the website that you provided. This might be helpful: "Don’t apply false discovery rate control to these p-values yet. (Why not? Because some kinds of FDR control are based on the assumption that your p-values near 1 are uniform. If you break this assumption, you’ll get way fewer significant hypotheses. Everyone loses)."