Entering edit mode
6.6 years ago
Joe Kherery
▴
140
Hello everyone,
I have a list of differentially expressed genes, on my list I have several genes and some LOC and LINC. I intend to do functional enrichment with this list, it is prudent to keep the LOC and LINC? I noticed that if I remove these, I get more information of enriched pathways. Is it correct to remove them? or was it tending to my analysis?
Regards
I'd normally remove any gene that has no pathway annotations at all before running functional similarity analysis
Presented like this, this seems the wrong thing to do. If this is OK, then why not remove genes with annotations you don't like ? I hope that at least this is mentioned when you report the results because on the face of it, this is fishing for significance. Removing genes from a list has to be well motivated and taken into consideration when interpreting the results of the analysis.
You may be interested in reading these papers:
- Multiple sources of bias confound functional enrichment analysis of global -omics data
- Annotation Enrichment Analysis: An Alternative Method for Evaluating the Functional Properties of Gene Sets
- Using predictive specificity to determine when gene set analysis is biologically meaningful.
Dear Jean-Karim Heriche,
I was a bit confused now, I can not remove LOC, LINC and MIR from my list? to make functional enrichment?
Since they do not have "GO functions". And keeping them can cause some canonical pathways not to have a significant p-value.
I do remove genes with annotations I don't like, from GO at least. If the only evidence code is IEA or IEP or similar. Ditched. I'm afraid I don't agree with you on this issue and I don't see it as p-hacking to remove genes that are unannotated across all genesets, although I can see how this may reduce hypergeometric p-values (if not GSEA). Provided you are transparent about the source of your annotations, the filtering of your genesets etc your functional similarity mining is perfectly defensible. However, it's very rare that these approaches provide any notable biological insight into a project
Dear, russhh
Do you do it manually? one by one in Uniprot ?
No. I use geneset definitions from GO or reactome programmatically. How are you performing your GSEA or Fisher tests?
I use MSigDB or enrich via web, sometimes I use panther db too.
Can you give me an example of how to filter my list of genes?
what language do you use?
Dear russhh, I only use R.
The approach depends on the data-structure used.
The gene-sets I use for fgsea come from reactome; Suppose the genes in my experiment are stored in the vector
my_genes
(as entrez ids). I'd obtain reactome annotations usinggenesets <- fgsea::reactomePathways(my_genes)
.genesets
is a list of vectors of entrez ids.Then I'd obtain the set of reactome-annotated genes:
universe <- purrr::reduce(genesets, union)
(note that by construction,universe
is a subset ofmy_genes
).Then I'd subset my experimental data so that I only consider those genes that have at least one annotation and are present on my experiment (this depends how you've stored your experimental data, but can typically be done using
my_query %in% universe
type syntax) .I can't really help any further without some sense of how your dataset is organised.