Well, gene enrichment (or 'gene-set enrichment analysis'; GSEA) is one of those things on which everyone has their own take, i.e., opinion. I've met people who don't even want to hear anything about it, to those who apparently idolise it. The way that you've carefully written your question tells me that you're in between these two extremes.
The first thing to consider is that gene enrichment is an in silico analysis, but many of the enrichment terms are based on curated datasets. For the Gene Ontology terms, for example, each and every term has an assigned evidence code, which can be taken into account when interpreting a particular enrichment. Take a look at my answer here: A: Go annotation reliability ?
Should I only select significant genes for my enrichment analyses,
pathway analyses? Why, why not?
The general idea of gene enrichment is that you have identified a group of genes as being statistically significantly associated to a particular condition and that you want to learn more about the potential functions, processes, pathways et cetera, that may be altered as a result. Thus, it does not make much sense to perform the enrichment on non-significant genes.
Edit: 11th January 2019: some programs can specifically take all genes in your dataset, perform enrichment, and then determine degree/level of enrichment by utilising the p-values and fold-changes. These methods are more powerful, I feel.
I have found several tutorials on DESeq/2, but I am not finding any
one that gives a clean and comprehensive view on how to further
process the data for downstream enrichment and visualization?
You will never find a 'clean and comprehensive' tutorial - everyone has their own take on it. DESeq2 is excellent at conducting analyses of [primarily] RNA-seq data but it's not a gene enrichment program.
What is the difference between doing GO enrichment by CC vs. BP vs MF?
- CC, cellular component
- BP, biological process
- MF, molecular function
Think of these as sub-classifications. Each of these will contain 1000s of gene enrichment terms that are organised in a hierarchical fashion. Most people will be interested in just BP and MF.
What is the difference between GO vs KEGG?
These are different organisations/groups.
- The Gene Ontology (GO)
Consortium is based in the USA and is funded by the NHGRI. The
consortium has been in existence for almost 20 years and its aim to
is define natural/healthy biological processes, molecular functions,
and components (as per the sub-classifications mentioned above).
Their gene enrichment categories and terms are either based on in
silico or confirmed laboratory evidence (as per the evidence codes
that I mentioned above).
- The Kyoto Encyclopaedia of Genes and
Genomes (KEGG) is a consortium based in Japan. It has been in
existence slightly longer than GO and is most recognised for the
curation of pathways in human and other species. KEGG covers a lot of
things other than pathways, though. Also KEGG focuses on both
normal/healthy and also disease-related pathways.
NB - it's important to remember that some GO terms relate to pathways too.
I am working with non model organism: in that case is it best to do
these analyses by matching the geneID/name of my organism to orthlog
geneID/name of a model organism? This may or maynot be a good idea
because certain pathways between organisms might be different, but
what is any proposed solution.
If you use an enrichment tool like DAVID, your species of interest is most likely included in this and, in addition, with DAVID, you can do enrichment on both GO and KEGG (and other databases) at the same time. On DAVID's main page, go to Functional Annotation and there you'll see a text box where you can input your genes.
My advice to you is to do the enrichment but to be cautious about the interpretation of the results. It is quite easy to 'cherry pick' the enrichment terms that you want to see, i.e., those that fit your hypothesis(es). If you get lucky and everything comes up for which you had hoped, I would still exercise caution. Don't get too excited by gene enrichment.
In terms of filtering enriched terms, if you use DAVID, you can filter enrichment terms based on a Benjamini P value. In terms of displaying gene enrichments, I would recommend simple displays like these:
I think Kevin is giving you great advice here!
I would add that the underlying assumption in any pathway enrichment analysis is that the genes in the pathway are assumed to be independent variables. In other words, enrichment analysis essentially only "counts" the number of DEGs on a any given pathway.
Depending on your experimental design, this may be an appropriate approach, but in my experience, people are usually looking for a more comprehensive analysis approach but are unaware of one.
You might consider using SPIA (Signaling Pathway Impact Analysis) if you want to use an R/bioconductor-based approach or possible RontoTools. These approaches use a topolgy-based approach which looks at each DEGs' role, position, and interaction to identify perturbed pathways rather than simply enriched.
If you want to use a web-based version of these tools (without any command-line use) you can try it for free in iPathwayGuide from Advaita Bioinformatics.
That is a very good point, andrew
Thank you so much Kevin for highlighting your points in such a comprehensive manner. It has cleared my doubts a lot. If I may have more questions, I will let you know.
Hi Kevin, Can you tell me what tool was used to generate these plots and link to this plot, so I can check some description.
Thanks,
Hi friend, the top plot is just an Enhanced OncoPrint, made using the complexheatmap package. The plot on the left is mutation data, whilst on the right it's gene enrichment based on the genes in which the mutations are found.
The bottom plot is my own but based on the functions provided by complexheatmap. It's essentially the same enrichment plot on the top-right, but I've added a lot of annotation and have split the heatmap based on up- and down-regulated genes.
I would encourage you to devote a single day to learning complexheatmap, as you will never then go back to using the other heatmap functions. It is highly flexible and the possibilities are endless. If you run into difficulty, post a question here and I should pick it up.