Question

Functional analysis of RNASeq differentially expressed genes

2

Entering edit mode

5.0 years ago

Colari19 ▴ 90

Hi,

I have a few questions about how best to derive more intuitive biological meaning from a differential expression analysis of RNASeq data.

So far I have been using two approaches:

(1) Perform differential expression analysis with Limma, then input DEGs that meet a given significance threshold (i.e. q-value <= 0.05) into Ingenuity Pathway Analysis, or do a standard GO analysis with some other package in R (other alternatives include the DAVID and AmiGO web servers).

(2) Use the Limma CAMERA function to find gene sets (derived from MSigDB Hallmark and C2 collections) that are highly ranked for differential expression, rather than individual genes like in approach (1). Alternatives to CAMERA include GSEA, QuSAGE etc.

These two approaches are obviously motivated by the fact that a huge list of differentially expressed genes is hard to make sense of by itself in most circumstances, but I'm finding it hard to decide which of the two approaches is the best to use.

Approach (1) is obviously limited by the fairly arbitrary significance cut-off used to define "significant" DEGs for further functional analysis, whereas with approach (2) you have to trust that the gene sets you use actually reflect the biology you think they do.

I realise it's probably the case that there isn't a "best", but I'm hoping this post might start a discussion on how (and when, and why) bioinformaticians like to go about deriving functional meaning from their differential expression results. Is there any harm in just using both approaches and then using your best judgement to make sense of what the results as a whole mean?

Some other more specific questions:

Say I use approach (2) to identify an MSigDB gene set that is highly ranked in terms of differential expression. If the gene set as a whole is called as differentially expressed, but the individual genes in that gene set are not (using approach(1)), what does this mean? In my experience this is quite often the case. Is the idea here that the changes in expression of each individual gene are too small for them to be called as differentially expressed, but the level of coordination leads to the gene set being called as differentially expressed? Do gene set methods therefore have an easier time of finding differential expression? I realise this is all very general and answers might vary depending on the specific method.

Thank you, and apologies for what are probably very beginner questions.

PS: I included RNASeq in the title but this post probably applies to microarray as well for any future readers.

RNA-Seq Limma CAMERA IPA GSEA • 2.2k views

ADD COMMENT • link updated 5.0 years ago by h.mon 35k • written 5.0 years ago by Colari19 ▴ 90

score 6 · Accepted Answer · 2019-11-13

There are two aspects to gene set enrichment analysis, the databases defining the sets, and the analysis algorithm. The (now a bit old) paper A Comparison of Gene Set Analysis Methods in Terms of Sensitivity, Prioritization and Specificity has a review on analysis methods in its introduction. CAMERA and GSEA are called Functional Class Scoring (FCS) methods in that paper terminology. IPA uses an Over-Representation Analysis (ORA) test - most probably an hypergeometric test, but I don't have IPA to be certain.

Any gene set can be analysed by these two classes of methods. To say more explicitly, CAMERA can be used with any kind of set, including GO ontology. However, GO ontologies have a complicating feature in that the categories are hierarchically organized, I like the methods implemented in topGO to deal with this, another option is to reduce the resulting significative sets with some semantic similarity method.

IPA's advantage is its (proprietary) curated sets. On the analysis side, it uses a fairly standard approach. If IPA gene sets can be exported, then these sets could be analysed using CAMERA, for example - but I don't know if IPA license allows this use.

Is there any harm in just using both approaches and then using your best judgement to make sense of what the results as a whole mean?

Potentially no harm, specially if the several analyses complement each other. But be careful to not cherry-pick the results that suit your needs, though, and ignore those that are "unpleasant".

If the gene set as a whole is called as differentially expressed, but the individual genes in that gene set are not (using approach(1)), what does this mean?

It is not "the gene set as a whole" that is differentially expressed, what happens is the genes in the set have a significantly higher score than the genes not in the set. It could be all genes in the gene set are going in the same direction (the "the gene set as a whole" is differentially expressed), but it could also be some genes are going in the same direction while others are not changing, but even so, the score of the gene set is higher than the score of non-gene set.

Is the idea here that the changes in expression of each individual gene are too small for them to be called as differentially expressed, but the level of coordination leads to the gene set being called as differentially expressed? Do gene set methods therefore have an easier time of finding differential expression?

Yes, that is exactly the case.