Seeking a tool for enrichment of a small list of ranked genes
1
0
Entering edit mode
4 weeks ago
Aspire ▴ 370

Tools like enrichR perform can perform enrichment on a small unranked list of genes.

For gene lists which are the output of an experiment it is possible to supply not only the name of a gene, but also its p-value or another statistic to rank the genes by.

Is there a tool which enables to perform enrichment of a small (unsuitable for GSEA due to its size), ranked list of genes? That would enable to use more information as input, and thus get more precise results.

enrichment • 766 views
ADD COMMENT
0
Entering edit mode

What's preventing you from performing over-representation analysis (ORA) instead of GSEA? Even with GSEA you could adjust the minimum set size parameter down to single digit values and it would still work I believe.

ADD REPLY
0
Entering edit mode

I am hoping to get more power than with standard ORA. I think that the input set for GSEA needs to be some thousands of genes (not ~50 genes as in my case)

ADD REPLY
0
Entering edit mode

I am hoping to get more power than with standard ORA

You're very much limited at smaller list sizes because you have fewer ways to estimate whether the genes in your gene list are truly over-represented/enriched compared to if you were to take the whole population you are sampling from since things like permutation testing are not possible.

How small is your list and is the reason you are wanting a more powered analysis because you aren't getting meaningful results from a typical ORA approach?

ADD REPLY
0
Entering edit mode

The ORA approach does give sensible results (with around 50 genes). However I want to try and be more specific (get a better sense which cell type is characterized by change of this 50 genes).

Besides, I just wonder in general whether such an approach (taking a ranked list) exists. A typical ORA example does not take into account information that is often available (ranking the genes by a statistic). So, improving it could be of interest.

ADD REPLY
0
Entering edit mode

If you wanted to get weird with it you could do something like 80% resampling of your 50 genes list and then run ORA and collect all the results and count how often all gensets are enriched. This stuff runs quick, so will probably take as long as an original java GSEA app to run. However, with a lot of things like GSEA, ORA etc I often find your reference geneset collecting can significantly influence your outcome so it's often worth dialing that in properly too

ADD REPLY
0
Entering edit mode

What exactly is the information you're using for ranking the genes?

ADD REPLY
0
Entering edit mode

The Wald test statistic from Deseq2 differential expression test.

ADD REPLY
2
Entering edit mode
4 weeks ago

When you have small list of genes it makes less sense to use statistics.

In that case you should directly interpret the genes for what they are. No statistics are needed.

For example, a "list" containing a single gene that causes a specific phenotype would never be produced as "enriched" in any analysis.

I mentioned this maybe in a related post; I wrote this tool to visualize functional annotations for short (or large) lists of genes:

  • GeneScape: A Python package for gene ontology visualization

https://joss.theoj.org/papers/10.21105/joss.06624

ADD COMMENT
1
Entering edit mode

Technically, I am interested in enrichment of cell type enrichment, which is not available with GeneScape.

However, your comment that for a small amount of genes direct interpretation is better than statistics - is spot on. This led me to think that I might introduce the results with a visualization.

Something like this :

enter image description here

This is a clustergram from Enrichr. In this specific case (this is example data, where the classification is more strongly pronounced than in my own), it is very easy to see that there are two main clusters of cell types which are enriched in the data, and the visualization imho is better than statistics. What do you think?

ADD REPLY
1
Entering edit mode

I may be in minority - hence not be representative of a reviewer's opinion. In general I put little stock statistical enrichment scores because I don't believe the methodologies apply - our knowledge is incomplete and we don't know what the background or null model should be. How could we detect any effect size then?

Thus I prefer visualization similar to what you produce.

But as cynical as this may sound - I am a realist - you should also produce some p-values as well. Even if it just for the sake of p-values, because many reviewers will want some sort of p-value to be present.

ADD REPLY
0
Entering edit mode

I don't think you necessarily have to be so selective in your preference. I could easily produce you a convincing visualisation for an enrichment I have absolutely no faith in is realistically representing the biology of the experiment I am looking at. I could also equally provide some hacky enrichment methodology and quote you some p-values or even be cheeky and quote unadjusted p-values in the main text. These things all need to be taken together and with a healthy dose of scrutiny.

Putting a reviewer's hat on, I always want to see a general summary of enrichment as some kind of overwiew figure of pathways, but also some examples of significant individual enrichment plots showing genes (even if in supplementary) with statistics quoted throughout, and at least see that you attempted a relevant method even if I have no guarantee you didn't do something funky.

You can place various weighting on the different pieces of evidence but when it comes to working through an analysis correctly, all the pieces are important and necessary.

ADD REPLY

Login before adding your answer.

Traffic: 1635 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6