Hi all
I am analyzing bulk RNA-seq data with GSEA and MSigDB to identify significantly enriched pathways.
I am interested in which signaling pathways are enriched, so I am planning on using "C2: curated gene sets", its subcollection "CP: Canonical pathways", or another subset "KEGG_MEDICUS subset of CP".
However, these collections contain many pathways that I am not interested in.
Considering that inflating the number of pathways being tested increases the number of hypothesis testing being performed, unnecessarily penalizing all relevant tests to control for Type I error, Q1) is it acceptable to further subset these collections/subcollections for pathways I am interested in testing?
Or do I need to stick with the entire collection generated by Broad Institute.
From C7: immunologic signature gene sets, there are certain studies and gene collections that I am interested in seeing if they are enriched in our experiment. Q2) Can I just test those individually in GSEA or do I need to feed the entire C7 collection?
Thanks!
Q1: Yes, I personally think meaningful subsetting will help reducing multiple-testing burden.
Q2: Whatever floats your boat. Personally I think these enrichment analysis are a mess anyway, both statistically and in the sense that the collections are not really standardized (sometimes REACTOME terms for example contain genes that cannot be mapped because of typos) and included genes are (heavily) redundant between pathways.
After all, these analysis may suggest something, and that always needs to be confirmed by other analysis or experiments. I would never see term enrichment of any kind as a "proof" for anything.
The following is definitely worth a read if you are doing gene set enrichment analyses
Urgent need for consistent standards in functional enrichment analysis
Thank you both for your suggestions. I will keep those points in mind and will also try to follow the best practices outlined in the linked paper.