Hi all,
I am new to bioinformatics and am currently learning how to use GSEA.
Background: I analyzed my RNA-Seq results using DESeq2, and am now learning to perform GSEA. For my project, in broad terms, I have samples from sick patients and healthy patients. My plan was to perform GSEA to identify enriched gene sets in the sick patients, and then perform Leading Edge analysis to view genes that are present across many of the enriched gene sets. I am particularly interested in gene sets/genes having to do with immune responses. The MSigDB I used was BP of GO.
As I am working with this program for the first time, I am stumped about two things:
The GSEA analysis came back with "888 gene sets are significantly enriched at nominal pvalue < 1%" and "1815 gene sets are significant at FDR < 25%". And so I am a little bit overwhelmed with the volume of data. In this type of analysis, is it sufficient to look at and work with, for example, the top 50 enriched gene sets and continue my Leading Edge analysis with those? -> my guess is no and that this would lead me to miss out on potentially interesting results ?
Looking at my top 20 enriched gene sets, for example, there are a number of gene sets that pertain to my experiment and what I am interested in, such as those having to do with pattern recognition receptor signaling, TLR signaling, antigen processing, etc. In addition, there are also highly enriched gene sets such as golgi vesicle transport, ER to golgi mediated transport, vesicle targeting which I am less interested in as they have less to do with immune responses. Is there a method to filter my GO results for ones having to do with immune response, and perform Leading Edge analysis on that filtered subset?
I guess what it boils down to is - I am overwhelmed with the # of enriched gene sets and volume of data, and am unsure of where to go next in my analysis! Ideally, I would like to narrow down my list of DEGs to a few genes that I could explore further for their role in disease pathophysiology.
I would appreciate any help/suggestions/advice! I hope my question was clear - I am still new to bioinformatics and am not always certain about the terminology and stuff :)
Hi!
In this case there are some strategies to solve your issues.
First, if you want to study gene sets related to immune response, you are able to create your own
gmt
files only selecting those gene sets associated with the immune response. Then, you will reduce the number of gene sets.Second, I suggest you to perform your analysis using R packages such as fGSEA. Running GSEA with this package will help you to perform leading edge analysis following some advices form this post.
Also, I have question, what is your input data for GSEA? Genes passing abundance filter or genes obtained from differential expression analysis?
Best regards