Dear Biostars: I'm a newbie in RNA-Seq and I'm trying to understand the different parts of the analysis. After reading papers, manuals, tutorials and basically whatever comes to my screen, it didn't manage to find "diagnostic" test(s) to run after a differential gene expression analysis. This, in order to know if the p-vals I'm obtaining are "correct" or following a distribution or fits better to one (or several) of the models I'm specifying... or are just the results of pressing some buttons. I understand that the DGE analyses are more complex than SNP association tests, but in the GWAS world, a QQ-plot could tell me if the p-vals I was obtaining were following a normal distribution.
So, in brief, is there any "Diagnostic" analysis to know if my results are ok after a DGE analysis?
(comments and suggested lectures are welcome)
Welcome to biostars. In case you don't get the answer you search here, the people at https://support.bioconductor.org/ are more 'specialized' in this type of questions.
Tests of distributional fit (such as QQ-plots) are not common in the DGE world. In thory the p-values from a DGE experiment, if there were not genuinely deferentially expressed genes would be uniform, however in contrast to GWAS we usually expect quite a lare number of genes to be DE in an RNA-seq experiment. Usually DGE significance thresholds are made using q-values rather than p-values, and one would not expect q-values to be uniformly distributed.
Diagnostic plots you will see used are: Do samples cluster by the expect main source of variance - that is if you perform hierarchical clustering on the top 1000 or so most variant genes, does condition A cluster with condition B? Also principle co-ordination plots can also be help for detecting outlying samples.
An MA plot is another classic diagnostic plot from a DGE experiment - here you can check that normalization has performed as expected and that the expected base mean, fold change and P-value relationships hold.
Finally, if you are using edgeR or DESeq, you can produce plots of the dispersion fitting process to make sure that looks reasonable.
@i.sudbery, Thanks for you answer. I might have a second question: How do you know from an MA-plot that the relationships hold or that the dispersion plots looks reasonable. From the first ones, I guess, one would expect no genes with significant LFC at low counts (something else apart of this?). For the dispersion plots... data fitting to the trend?.. or the how squeezed gets the data?
In general from the MA plots you would expect: RNA-seq MA plots usually look like a funnel, with high LFC variance at low counts and how LFC variance at high counts, and the "stem" of this funnel to fall in a straight line at and LFC of 0. You would expect that genes would need a higher LFC at lower counts to be called DE. For the distpersion plots, you would generally be looking for the trend line to fit the data. See section 3.7 of the DESeq2 guide.
You can try Gene Set enrichment Analysis after doing differential expression.
There are lot of signatures to look at.
Also you can manually curate gene signatures to look at if you are expecting certain trends.(pathway up/down based on knockouts).
Welcome to biostars. In case you don't get the answer you search here, the people at https://support.bioconductor.org/ are more 'specialized' in this type of questions.