How Do I Do Simple Go Term Lookup Given A Gene (Or Mrna) Identifier?

Entering edit mode

15.2 years ago

Ryan Thompson ★ 3.7k

Suppose I have a gene ID such as NM_030621, which happens to be one of the human mRNA splice variants of DICER1. How can I programmatically get a list of GO terms with which this ID is associated? Going through protein or gene IDs as intermediaries is acceptable.

I want to do this for a whole list of such IDs, all corresponding to RefSeq mRNA sequences. I don't want to do enrichment analysis or anything, just get a list of GO terms for each mRNA.

Preferred languages are Perl and R.

gene annotation database • 38k views

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Ryan Thompson ★ 3.7k

Entering edit mode

Thanks for all the answers, everyone. I ended up using biomaRt in R.

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Ryan Thompson ★ 3.7k

Entering edit mode

Nice question, awesome answers, just added my suggestion using GOOSE.

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Khader Shameer 18k

Entering edit mode

Thank you for this post ! I can't believe that something so elemental would be so complicated.

ADD REPLY • link 6.5 years ago by lagartija ▴ 160

Entering edit mode

15.2 years ago

Neilfws 49k

Using R, you can do this with the biomaRt package. Here is a quick example:

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
results <- getBM(attributes = c("refseq_dna", "go_molecular_function_id",
           "go_molecular_function__dm_name_1006"), filters = "refseq_dna",
           values = c("NM_030621"), mart = mart)

In the getBM() function, we specify which attributes to return (RefSeq ID, GO MF ID, GO MF Name), the filters to use for the query (RefSeq ID) and the filter values (NM_030621). You could also pass a list of values, e.g. from a data frame column as mydf$column.

Result:

        refseq_dna  go_molecular_function_id go_molecular_function__dm_name_1006
1       NM_030621                 GO:0005515                     protein binding
2       NM_030621                 GO:0008026     ATP-dependent helicase activity
3       NM_030621                 GO:0000166                  nucleotide binding
4       NM_030621                 GO:0000287               magnesium ion binding
5       NM_030621                 GO:0004386                   helicase activity
6       NM_030621                 GO:0004519               endonuclease activity
7       NM_030621                 GO:0016787                  hydrolase activity
8       NM_030621                 GO:0030145               manganese ion binding
9       NM_030621                 GO:0004525           ribonuclease III activity
10      NM_030621                 GO:0003725         double-stranded RNA binding
11      NM_030621                 GO:0005524                         ATP binding
12      NM_030621                 GO:0003723                         RNA binding
13      NM_030621                 GO:0003676                nucleic acid binding
14      NM_030621                 GO:0003677                         DNA binding

You could also use the BioMart website, where you can upload a list of IDs, specify attributes and filters and download results as delimited text files.

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.2 years ago by Neilfws 49k

Entering edit mode

Thanks, that worked great. Biomart looks incredibly useful.

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Ryan Thompson ★ 3.7k

Entering edit mode

Neil, do you known why there are more GO identifiers on ensembl for this protein?

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

I think the Ensembl page shows all three GO ontologies (BP, CC and MF) - the example above is only MF.

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Neilfws 49k

Entering edit mode

When I try this, I find that 'refseq_dna', 'go_molecular_function_id', and 'go_molecular_function__dm_name_1006' are not valid attributes. Has something changed?

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 10.8 years ago by Greg P ▴ 70

Entering edit mode

But, in this way, would not you get limited to the RefSeq ID's which have Ensemble information in the database?

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 10.5 years ago by Prakki Rama ★ 2.7k

Entering edit mode

What if your organism is not in ensembl list anymore, like deinococcus radiodurans?

ADD REPLY • link 8.6 years ago by xioli2013 ▴ 10

Entering edit mode

How to do this for only BP goterms?

ADD REPLY • link 6.2 years ago by arya64898 • 0

Entering edit mode

15.2 years ago

Pierre Lindenbaum 166k

My XSLT solution:

The stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >

<xsl:output method="text" encoding="ISO-8859-1"/>

<xsl:template match="/">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="Dbtag[Dbtag_db='GeneID']">
  <xsl:for-each select="Dbtag_tag/Object-id/Object-id_id">
     <xsl:variable name="genid" select="."/>
    <xsl:variable name="url" select="concat('&lt;a href=" http:="" eutils.ncbi.nlm.nih.gov="" entrez="" eutils="" efetch.fcgi?db="gene&amp;amp;retmode=xml&amp;amp;id=',$genid)" "="" rel="nofollow">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=',$genid)" />
    <xsl:apply-templates select="document($url,//Other-source)" mode="go"/>
  </xsl:for-each>
</xsl:template>

<xsl:template match="Other-source" mode="go">
 <xsl:if test="Other-source_src/Dbtag/Dbtag_db='GO'">
 <xsl:value-of select="concat('GO:',Other-source_src/Dbtag/Dbtag_tag/Object-id/Object-id_id)"/>
<xsl:text>        </xsl:text>
<xsl:value-of select="Other-source_anchor"/>
<xsl:text>
</xsl:text>
 </xsl:if>
</xsl:template>

<xsl:template match="text()">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()" mode="go">
</xsl:template>

</xsl:stylesheet>

Apply this stylesheet with xsltproc for NM_030621:

xsltproc --novalid stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_030621&rettype=xml"

Result:

GO:5524         ATP binding
GO:8026         ATP-dependent helicase activity
GO:3725         double-stranded RNA binding
GO:4519         endonuclease activity
GO:4386         helicase activity
GO:16787        hydrolase activity
GO:46872        metal ion binding
GO:166          nucleotide binding
GO:5515         protein binding
GO:4525         ribonuclease III activity
GO:6396         RNA processing
GO:1525         angiogenesis
GO:48754        branching morphogenesis of a tube
GO:35116        embryonic hindlimb morphogenesis
GO:31047        gene silencing by RNA
GO:30324        lung development
GO:31054        pre-microRNA processing
GO:30422        production of siRNA involved in RNA interference
GO:19827        stem cell maintenance
GO:30423        targeting of mRNA for destruction involved in RNA interference
GO:16442        RNA-induced silencing complex
GO:5737         cytoplasm
GO:5622         intracellular

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

Interesting solution!

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Istvan Albert 102k

Entering edit mode

Pierre could you share your experience/comments on when to go the xslt route for such tasks?

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Marcin Cieslik ▴ 520

Entering edit mode

... hum, just a kind of feeling... :-) I first think of XSLT when I know that I can get the source of data as XML, when this source is not too large (must fit in memory), when speed is not an issue and I don't need a complicated processing (e.g. tokenizing, aggregation of data (mean/uniq/max/etc...) )

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

10.8 years ago

adam.maikai ▴ 530

Using MyGene.info's new R/Bioconductor package mygene, this can be done simply with the query() function.

library(mygene)
res<-query('NM_030621', fields='go', species='human')$hits
lapply(res, as.list)

This returns a data.frame for CC, MF, and BP, matching score, and the gene id.

	$go
	$go$CC
	$go$CC[[1]]
	term id evidence pubmed
	1 endoplasmic reticulum-Golgi intermediate compartment GO:0005793 IEA NA
	2 cytosol GO:0005829 TAS NA
	3 RISC complex GO:0016442 IDA 15973356
	4 axon GO:0030424 IEA NA
	5 dendrite GO:0030425 IEA NA
	6 growth cone GO:0030426 IEA NA

	$go$MF
	$go$MF[[1]]
	term pubmed id evidence
	1 double-stranded RNA binding 12411504 GO:0003725 IDA
	2 helicase activity NA GO:0004386 IEA
	3 ribonuclease III activity NA GO:0004525 IDA
	4 protein binding 12526743 GO:0005515 IPI
	5 ATP binding NA GO:0005524 IEA
	6 miRNA binding NA GO:0035198 IEA
	7 metal ion binding NA GO:0046872 IEA

	$go$BP
	$go$BP[[1]]
	term id evidence pubmed
	1 negative regulation of transcription from RNA polymerase II promoter GO:0000122 ISS NA
	2 angiogenesis GO:0001525 IEA NA
	3 post-embryonic development GO:0009791 IEA NA
	4 zygote asymmetric cell division GO:0010070 IEA NA
	5 gene expression GO:0010467 TAS NA
	6 negative regulation of Schwann cell proliferation GO:0010626 ISS NA
	7 positive regulation of gene expression GO:0010628 IEA NA
	8 positive regulation of Schwann cell differentiation GO:0014040 ISS NA
	9 stem cell maintenance GO:0019827 IEA NA
	10 spinal cord motor neuron differentiation GO:0021522 IEA NA
	11 nerve development GO:0021675 ISS NA
	12 olfactory bulb interneuron differentiation GO:0021889 IEA NA
	13 cerebral cortex development GO:0021987 IEA NA
	14 lung development GO:0030324 IEA NA
	15 production of siRNA involved in RNA interference GO:0030422 IDA 15973356
	16 targeting of mRNA for destruction involved in RNA interference GO:0030423 IMP 15973356
	17 pre-miRNA processing GO:0031054 IDA 15973356
	18 hair follicle morphogenesis GO:0031069 IEA NA
	19 positive regulation of myelination GO:0031643 ISS NA
	20 peripheral nervous system myelin formation GO:0032290 ISS NA
	21 embryonic hindlimb morphogenesis GO:0035116 IEA NA
	22 production of miRNAs involved in gene silencing by miRNA GO:0035196 ISS NA
	23 multicellular organism growth GO:0035264 IEA NA
	24 regulation of viral genome replication GO:0045069 IEA NA
	25 regulation of neuron differentiation GO:0045664 IEA NA
	26 mRNA stabilization GO:0048255 IEA NA
	27 spleen development GO:0048536 IEA NA
	28 reproductive structure development GO:0048608 IEA NA
	29 regulation of oligodendrocyte differentiation GO:0048713 IEA NA
	30 branching morphogenesis of an epithelial tube GO:0048754 IEA NA
	31 neuron projection morphogenesis GO:0048812 ISS NA
	32 spindle assembly GO:0051225 IEA NA
	33 defense response to virus GO:0051607 IEA NA
	34 regulation of cell cycle GO:0051726 IEA NA
	35 cardiac muscle cell development GO:0055013 IEA NA
	36 inner ear receptor cell development GO:0060119 IEA NA
	37 intestinal epithelial cell development GO:0060576 IEA NA
	38 regulation of enamel mineralization GO:0070173 IEA NA
	39 hair follicle cell proliferation GO:0071335 IEA NA
	40 RNA phosphodiester bond hydrolysis GO:0090501 IDA NA
	41 RNA phosphodiester bond hydrolysis, endonucleolytic GO:0090502 IDA NA
	42 RNA phosphodiester bond hydrolysis, endonucleolytic GO:0090502 IEA NA
	43 positive regulation of miRNA metabolic process GO:2000630 IEA NA


	$`_score`
	$`_score`[[1]]
	[1] 0.8782682

	$`_id`
	$`_id`[[1]]
	[1] "23405"

view raw biostars-116956.R hosted with ❤ by GitHub

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 10.8 years ago by adam.maikai ▴ 530

Entering edit mode

15.2 years ago

Jandot ▴ 370

Hey Ryan,

Just had a very quick look at this, so am not presenting the whole solution here. However, using the ruby-ensembl-api and given a

[?]

Result:

[?]

GO:0005488
GO:0005515
GO:0007275
GO:0005622
GO:0007155

[?]

This uses the Ensembl stable ID instead of the RefSeq accession, but you should be easily able to link one to the other.

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Jandot ▴ 370

Entering edit mode

15.2 years ago

Dario Corrada ▴ 70

For my work I've used an R package called CORNA 1 and I find very useful functions for a batch retrieves. Here there is an example code chunk:

# This script retrieve a list of dataframes for crosslinks among RefSeqID, Ensembl IDs (transcript e gene), GO IDs (cellular component, molecular function e biological process)

library(CORNA);

biomart = "ensembl"; # BioMart DB (for a complete list of avalaible databases use "listMarts" function [biomaRt])
dataset = "mmusculus_gene_ensembl"; # dataset
# for a complete list of avalaible datasets use a chunck like this:
# library(biomaRt)
# mart <- useMart('ensembl')
# listDatasets(mart)
org="mmu"; # organism, e.g. mmu -> Mus Musculus

# crosslinks between RefSeq and Ensembl Transcript
refseq2ensembl_tran <-  BioMart2df.fun( biomart=biomart, dataset=dataset,
                                        col.old=c("refseq_dna", "ensembl_transcript_id"),
                                        col.new=c("refseq", "ensembl_transcript"));

# crosslinks between RefSeq and Ensembl Gene
refseq2ensembl_gene <-  BioMart2df.fun( biomart=biomart, dataset=dataset,
                                        col.old=c("refseq_dna", "ensembl_gene_id"),
                                        col.new=c("refseq", "ensembl_gene"));

# crosslinks between Ensembl Transcript and GObp
tran2gobp  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_biological_process_id"),
                                col.new=c("ensembl_transcript", "gobp"));

# crosslinks between Ensembl Transcript and GOcc
tran2gocc  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_cellular_component_id"),
                                col.new=c("ensembl_transcript", "gocc"));

# crosslinks between Ensembl Transcript and GOmf
tran2gomf  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_molecular_function_id"),
                                col.new=c("ensembl_transcript", "gomf"));

# links between GO id and term
corna.go2term <- GO2df.fun(url = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz");

# pathway information from KEGG
corna.gene2path <- KEGG2df.fun(org = org);

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.2 years ago by Dario Corrada ▴ 70

Entering edit mode

Thank you very much for this link, because I just happen to be doing GO-term enrichment analysis on miRNA targets.

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Ryan Thompson ★ 3.7k

Entering edit mode

15.2 years ago

Andrew Su 5.0k

Seems like a good task for BioMart's MartService. Check out their API. You specify your query using their Query XML syntax, and a good way to generate a template is to do the query using the GUI and then export the corresponding XML.

Check out also this previous question, where QuickGO and goatools are mentioned...

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 15.2 years ago by Andrew Su 5.0k

Entering edit mode

15.2 years ago

Pierre Lindenbaum 166k

I was also interested in using the mysql server for ensembl for this query. After doing some little reverse engineering with the SQL schema, I wrote the following SQL query: query:

use homo_sapiens_core_48_36j;

select  distinct
    GENE_ID.stable_id as "ensembl.gene",
    RNA_ID.stable_id as "ensembl.transcript",
    PROT_ID.stable_id as "ensembl.translation",
    GO.acc as "go.acc",
    GO.name as "go.name",
    GOXREF.linkage_type as "evidence"
from

    ensembl_go_54.term as GO,
    external_db as EXTDB0,
    external_db as EXTDB1,
    object_xref as OX0,
    object_xref as OX1,
    xref as XREF0,
    xref as XREF1,
    transcript as RNA,
    transcript_stable_id as RNA_ID,
    gene as GENE,
    gene_stable_id as GENE_ID,
    translation as PROT,
    translation_stable_id as PROT_ID,
    go_xref as GOXREF
where
    XREF0.dbprimary_acc="NM_030621" and
    XREF0.external_db_id=EXTDB0.external_db_id and
    EXTDB0.db_name="RefSeq_dna" and
    OX0.xref_id=XREF0.xref_id and
    RNA.gene_id=GENE.gene_id and
    GENE.gene_id= GENE_ID.gene_id and
    RNA.transcript_id=OX0.ensembl_id and
    RNA_ID.transcript_id=RNA.transcript_id and
    PROT.transcript_id = RNA.transcript_id and
    OX1.ensembl_id=PROT.translation_id and
    PROT.translation_id=PROT_ID.translation_id and
    OX1.ensembl_object_type='Translation' and
    OX1.xref_id=XREF1.xref_id and
    GOXREF.object_xref_id=OX1.object_xref_id and 
    XREF1.external_db_id=EXTDB1.external_db_id and
    EXTDB1.db_name="GO" and
    GO.acc=XREF1.dbprimary_acc

order by GO.acc;

Execute:

mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306 < query.sql

Result:

ensembl.gene    ensembl.transcript      ensembl.translation     go.acc  go.name evidence
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0000166      nucleotide binding      IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0001525      angiogenesis    IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003676      nucleic acid binding    IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003677      DNA binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003723      RNA binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003725      double-stranded RNA binding     IDA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004386      helicase activity       IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004519      endonuclease activity   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004525      ribonuclease III activity       IDA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005515      protein binding IPI
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005524      ATP binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005622      intracellular   NAS
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0006396      RNA processing  IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0008026      ATP-dependent helicase activity IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0016787      hydrolase activity      IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0019827      stem cell maintenance   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030324      lung development        IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030422      RNA interference, production of siRNA   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030423      RNA interference, targeting of mRNA for destruction     IEP
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0031047      gene silencing by RNA   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0035116      embryonic hindlimb morphogenesis        IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0035196      gene silencing by miRNA, production of miRNAs   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0048754      branching morphogenesis of a tube       IEA

Note: my previous solution as well as Neil's one don't list all the GO terms visible here. This SQL looks probably better even if the result is still different with the web page (e.g.: I can see GO:0016442 on the web page, but not in my result ). The ensembl schema is complex, and I might have missed some links (left join ?) between the tables.

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

ok, it should work better with homo_sapiens_core_58_37c Thanks @joachimbaran

https://twitter.com/joachimbaran/status/14981387076

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

15.2 years ago

Khader Shameer 18k

You can get a quick look of annotation using GOOSE (GO Online SQL Environment) GOOSE Manual and a long list of sample queries here. For a perl based solution I will try with Onto-perl. Solution for your exact question is here. Loonger list of examples here - a valuable reference that provided most of the anticipated use cases in GO and solution using Onto-Perl.

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 15.2 years ago by Khader Shameer 18k