Here's one way to do this using biomaRt, based on Michael Dondrup 's information that you need to use GO ID GO:0003723
.
library(biomaRt)
## I'm using the uswest mirror as the main Ensembl site is very slow today
human <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl", mirror = "uswest")
results <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol","ensembl_transcript_id","transcript_biotype"),
filters = c("go"),
values = list("GO:0003723"),
mart = human)
dim(results)
#> [1] 5671 4
head(results)
#> ensembl_gene_id hgnc_symbol ensembl_transcript_id transcript_biotype
#> 1 ENSG00000262860 LSM14A ENST00000575811 protein_coding
#> 2 ENSG00000275051 PIWIL1 ENST00000613226 protein_coding
#> 3 ENSG00000275051 PIWIL1 ENST00000632888 protein_coding
#> 4 ENSG00000262860 LSM14A ENST00000570462 protein_coding
#> 5 ENSG00000278229 RPS17 ENST00000617731 protein_coding
#> 6 ENSG00000262156 APOBEC3A ENST00000623492 protein_coding
That finds all transcripts of genes that are directly annotated with the GO ID. You might also be interested in genes that are annoted with a child of "RNA Binding" in the GO heirachy. For that we can use the filter go_parent_term
instead. You can see this returns more results and that all of our first hits can be found in this second set of results (as you'd expect).
results2 <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol","ensembl_transcript_id","transcript_biotype"),
filters = c("go_parent_term"),
values = list("GO:0003723"),
mart = human)
dim(results2)
#> [1] 8077 4
table(results$ensembl_gene_id %in% results2$ensembl_gene_id)
#>
#> TRUE
#> 5671
Many thanks. It worked fine. I might need a slight modification. If I wanted to include only rows with unique ENSG how should I change the attributes of
getBM
? I eventually tried "ensembl_gene_id = unique" but it did not work.I think it depends on exactly how much information you want. The information you get back from Ensembl tends to be "unique rows". So if in the example above, if a gene has two transcripts it would appear twice. Similarly if it is annotated with two biotypes it will also appear twice.
If you're really just interested in the Ensembl ID for genes annotated with that GO term, you can ask for just the
ensembl_gene_id
attribute e.g.What bothers me a bit now is the difference in gene counts between the web-interface (3467) and biomaRt (1725). Possibly the web-interface counts the genes differently?
This is because of there's two possible filters both of which look like "Search by GO ID".
The 1725 results come from picking "GO ID(s)":
The 3467 hits are the result of choosing "GO Term Accession":
It's the equivalent to the filters
go
vsgo_parent_term
in the biomaRt examples above and is super confusing!This difference is that the second version finds genes like ENSG00000210049 which is annotated with GO:0030533 (triplet codon-amino acid adaptor activity) which is a child of GO:0003723 (RNA Binding). This first doesn't return a hit like that.