This is a follow up/crosspost on my question on the bioconductor list. I have two related questions:
On ensembl, if a gene is annotated with a specific GO term (e.g. GO:0005634 nucleus), how does ensembl decides whether parent GO terms are also included (e.g. the parents of GO:0005634)?
Before GO enrichment, like GSEA, shouldn't the gene annotations be augmented to included all the parent terms?
This is an example to illustrate my first question. Ensembl/biomart tells me that gene ENSG00000281813
has the term GO:0005634
, nucleus, from ontology cellular component. However the parents of GO:0005634
are not included in the annotation, at least not all of them. For example, the topmost parent, GO:0005575
(cellular component), is not there:
library(biomaRt)
mart <- useEnsembl("ensembl", "hsapiens_gene_ensembl", version=107)
gos <- getBM(filters=c('ensembl_gene_id'), attributes=c('ensembl_gene_id', 'go_id', 'name_1006', 'namespace_1003'), value=list(c('ENSG00000281813')), mart)
gos
ensembl_gene_id go_id name_1006 namespace_1003
1 ENSG00000281813
2 ENSG00000281813 GO:0005634 nucleus cellular_component
3 ENSG00000281813 GO:0046872 metal ion binding molecular_function
4 ENSG00000281813 GO:0016740 transferase activity molecular_function
5 ENSG00000281813 GO:0006355 regulation of transcription, DNA-templated biological_process
6 ENSG00000281813 GO:0003677 DNA binding molecular_function
7 ENSG00000281813 GO:0006325 chromatin organization biological_process
8 ENSG00000281813 GO:0016746 acyltransferase activity molecular_function
9 ENSG00000281813 GO:0006334 nucleosome assembly biological_process
10 ENSG00000281813 GO:0000786 nucleosome cellular_component
11 ENSG00000281813 GO:0043966 histone H3 acetylation biological_process
12 ENSG00000281813 GO:0000123 histone acetyltransferase complex cellular_component
13 ENSG00000281813 GO:0045893 positive regulation of transcription, DNA-templated biological_process
14 ENSG00000281813 GO:0004402 histone acetyltransferase activity molecular_function
15 ENSG00000281813 GO:0016573 histone acetylation biological_process
16 ENSG00000281813 GO:0005515 protein binding molecular_function
17 ENSG00000281813 GO:0042393 histone binding molecular_function
18 ENSG00000281813 GO:0045892 negative regulation of transcription, DNA-templated biological_process
19 ENSG00000281813 GO:0061629 RNA polymerase II-specific DNA-binding transcription factor binding molecular_function
20 ENSG00000281813 GO:0005654 nucleoplasm cellular_component
21 ENSG00000281813 GO:0045944 positive regulation of transcription by RNA polymerase II biological_process
22 ENSG00000281813 GO:0003712 transcription coregulator activity molecular_function
23 ENSG00000281813 GO:0070776 MOZ/MORF histone acetyltransferase complex cellular_component
24 ENSG00000281813 GO:0050793 regulation of developmental process biological_process
25 ENSG00000281813 GO:1903706 regulation of hemopoiesis biological_process
26 ENSG00000281813 GO:0016407 acetyltransferase activity molecular_function
One would think that ensembl includes only the most specific terms since the parents are automatically implied. However, this is not the case. For example, ENSG00000276595
does include the topmost term GO:0005575
but also some of its offspring:
gos <- getBM(filters=c('ensembl_gene_id'), attributes=c('ensembl_gene_id', 'go_id', 'name_1006', 'namespace_1003'), value=list(c('ENSG00000276595')), mart)
gos
ensembl_gene_id go_id name_1006 namespace_1003
1 ENSG00000276595 GO:0016020 membrane cellular_component
2 ENSG00000276595 GO:0016021 integral component of membrane cellular_component
3 ENSG00000276595 GO:0005783 endoplasmic reticulum cellular_component
4 ENSG00000276595 GO:0005515 protein binding molecular_function
5 ENSG00000276595 GO:0003674 molecular_function molecular_function
6 ENSG00000276595 GO:0005575 cellular_component cellular_component ****
7 ENSG00000276595 GO:0097225 sperm midpiece cellular_component
Is there a reason for this seemingly inconsistent behaviour?
Regarding my second question, I believe that data straight from ensembl/biomart is not suitable for GSEA as implemented in e.g. fgsea since genes should be first augmented to include all parent terms of each gene. Am I right...?
Thanks Ben, but I'm not sure this answers my (first) question, or if it does it just moves it somewhere outside ensembl. I'm not asking where the GO terms come from for a given gene. Instead, I'm wondering why some genes contain a specific term AND some of the ancestors of that term while other genes contain only the specific terms. In my opinion, it would be more user-friendly if a gene annotated with a specific term also included all the ancestor terms. If I'm not mistaken (and this is my second question), data retrieved from ensembl/biomart is not suitable for fgsea since genes are not fully annotated with parent terms.