Hi,
I have a few questions about how best to derive more intuitive biological meaning from a differential expression analysis of RNASeq data.
So far I have been using two approaches:
(1) Perform differential expression analysis with Limma, then input DEGs that meet a given significance threshold (i.e. q-value <= 0.05) into Ingenuity Pathway Analysis, or do a standard GO analysis with some other package in R (other alternatives include the DAVID and AmiGO web servers).
(2) Use the Limma CAMERA function to find gene sets (derived from MSigDB Hallmark and C2 collections) that are highly ranked for differential expression, rather than individual genes like in approach (1). Alternatives to CAMERA include GSEA, QuSAGE etc.
These two approaches are obviously motivated by the fact that a huge list of differentially expressed genes is hard to make sense of by itself in most circumstances, but I'm finding it hard to decide which of the two approaches is the best to use.
Approach (1) is obviously limited by the fairly arbitrary significance cut-off used to define "significant" DEGs for further functional analysis, whereas with approach (2) you have to trust that the gene sets you use actually reflect the biology you think they do.
I realise it's probably the case that there isn't a "best", but I'm hoping this post might start a discussion on how (and when, and why) bioinformaticians like to go about deriving functional meaning from their differential expression results. Is there any harm in just using both approaches and then using your best judgement to make sense of what the results as a whole mean?
Some other more specific questions:
Say I use approach (2) to identify an MSigDB gene set that is highly ranked in terms of differential expression. If the gene set as a whole is called as differentially expressed, but the individual genes in that gene set are not (using approach(1)), what does this mean? In my experience this is quite often the case. Is the idea here that the changes in expression of each individual gene are too small for them to be called as differentially expressed, but the level of coordination leads to the gene set being called as differentially expressed? Do gene set methods therefore have an easier time of finding differential expression? I realise this is all very general and answers might vary depending on the specific method.
Thank you, and apologies for what are probably very beginner questions.
PS: I included RNASeq in the title but this post probably applies to microarray as well for any future readers.
Thank you for your very informative response.