Entering edit mode
5.5 years ago
2nelly
▴
350
Hi all,
I was wondering what do normally people use as input of GO analysis (i.e keggres) for RNAseq after producing the DESeq2 output:
-- the raw output of DESeq with all genes? -- filtered for NAs genes ? or -- filtered for p.adj value?
I noticed that the results are quite different, which of course is normal since the dataset size is different
Thank you in advance
One should use significant genes (by whatever definition fits your scientific question, typically FDR < 0.05).
Yes, normally I filter the output by FDR and log2FC and then I use it for GO annotation.
However, I am wondering if this approach is the correct one or should I filter only for FDR as you mentioned.
I am sceptical because every single time that I followed the first approach and tried to GO RNAseq data, qvalues for paths are extremely high. Biologically, the output makes sense but how can I support the findings with high q values (>0.9)?
By trying the second approach things improved a bit only for p values and q values remained still high.
The list of background genes is also highly important, I would use genes with high enough baseMean - maybe a threshold for which you can start observing significant genes.
That can be true. However filtering for FDR<0.05 gives mostly genes with high baseMean. So, this approach does not help too much after FDR filtering
That's your list. I was talking about the background list of genes, usually it's all the genes of the organism and then you'll get enrichment for the tissue your samples are from.
If I understand correctly, you suggest to consider all genes with high baseMean, even those that are not significant(FDR>0.05 or 0.1).
Only as a background. You compare the genes with FDR<0.05 to the rest of the genes that have high enough expression
That is confusing me...Compare them in what context? How is this gonna help the GO annotation?
You'll have to understand how the test works. Some reading: https://david.ncifcrf.gov/helps/functional_annotation.html
yes, ok it s Fisher.
Let me rephrase-simplify the main question: would you use all genes for GO annotation or only a subset of significant genes.
In other words would you use a full unfiltered list of DE genes or a filtered one. This will definitely affect the calculation of adjusted p value. Imagine if you do Bonferroni correction in two sets of 100 genes and 1000 genes. the corrected p value will be different. Of course FDR is more robust but it is. Filtering is something subjective and can produce different results.
According to DAVID example, the 300 genes is the list of DE genes. Then I assume my main process of filtering DESeq for logFC and FDR is correct.
The FDR is for the number of pathways you test, not genes.
Yes, this is the FDR of for GO. For instance if 30 pathways were found, the adjusted p value is corrected by the 30 different test.
The FDR I mentioned before is about the output of DE genes i.e. from DESeq.
Would you feed any GO analysis software or algorithm with a filtered DE genes file or non-filtered?Because any further calculation for pathways' p and q value will be affected.
I would use all genes with FDR < 0.05 (or 0.1). Yet, this wasn't my point
Yes I understand, but this was my main question. Thank you