Question

How Do I Do Simple Go Term Lookup Given A Gene (Or Mrna) Identifier?

21

Entering edit mode

14.5 years ago

Ryan Thompson ★ 3.6k

Suppose I have a gene ID such as NM_030621, which happens to be one of the human mRNA splice variants of DICER1. How can I programmatically get a list of GO terms with which this ID is associated? Going through protein or gene IDs as intermediaries is acceptable.

I want to do this for a whole list of such IDs, all corresponding to RefSeq mRNA sequences. I don't want to do enrichment analysis or anything, just get a list of GO terms for each mRNA.

Preferred languages are Perl and R.

gene annotation database • 37k views

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 14.5 years ago by Ryan Thompson ★ 3.6k

2

Entering edit mode

Thanks for all the answers, everyone. I ended up using biomaRt in R.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 14.5 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

Nice question, awesome answers, just added my suggestion using GOOSE.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 14.5 years ago by Khader Shameer 18k

0

Entering edit mode

Thank you for this post ! I can't believe that something so elemental would be so complicated.

ADD REPLY • link 5.9 years ago by lagartija ▴ 160

Ram · Answer 1 · 2010-05-27

20

Entering edit mode

14.5 years ago

Neilfws 49k

Using R, you can do this with the biomaRt package. Here is a quick example:

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
results <- getBM(attributes = c("refseq_dna", "go_molecular_function_id",
           "go_molecular_function__dm_name_1006"), filters = "refseq_dna",
           values = c("NM_030621"), mart = mart)

In the getBM() function, we specify which attributes to return (RefSeq ID, GO MF ID, GO MF Name), the filters to use for the query (RefSeq ID) and the filter values (NM_030621). You could also pass a list of values, e.g. from a data frame column as mydf$column.

Result:

        refseq_dna  go_molecular_function_id go_molecular_function__dm_name_1006
1       NM_030621                 GO:0005515                     protein binding
2       NM_030621                 GO:0008026     ATP-dependent helicase activity
3       NM_030621                 GO:0000166                  nucleotide binding
4       NM_030621                 GO:0000287               magnesium ion binding
5       NM_030621                 GO:0004386                   helicase activity
6       NM_030621                 GO:0004519               endonuclease activity
7       NM_030621                 GO:0016787                  hydrolase activity
8       NM_030621                 GO:0030145               manganese ion binding
9       NM_030621                 GO:0004525           ribonuclease III activity
10      NM_030621                 GO:0003725         double-stranded RNA binding
11      NM_030621                 GO:0005524                         ATP binding
12      NM_030621                 GO:0003723                         RNA binding
13      NM_030621                 GO:0003676                nucleic acid binding
14      NM_030621                 GO:0003677                         DNA binding

You could also use the BioMart website, where you can upload a list of IDs, specify attributes and filters and download results as delimited text files.

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 14.5 years ago by Neilfws 49k

0

Entering edit mode

Thanks, that worked great. Biomart looks incredibly useful.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 14.5 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

Neil, do you known why there are more GO identifiers on ensembl for this protein?

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 14.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I think the Ensembl page shows all three GO ontologies (BP, CC and MF) - the example above is only MF.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 14.5 years ago by Neilfws 49k

0

Entering edit mode

When I try this, I find that 'refseq_dna', 'go_molecular_function_id', and 'go_molecular_function__dm_name_1006' are not valid attributes. Has something changed?

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by Greg P ▴ 70

0

Entering edit mode

But, in this way, would not you get limited to the RefSeq ID's which have Ensemble information in the database?

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 9.8 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

What if your organism is not in ensembl list anymore, like deinococcus radiodurans?

ADD REPLY • link 7.9 years ago by xioli2013 ▴ 10

0

Entering edit mode

How to do this for only BP goterms?

ADD REPLY • link 5.6 years ago by arya64898 • 0

Ram · Answer 2 · 2010-05-27

My XSLT solution:

The stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >

<xsl:output method="text" encoding="ISO-8859-1"/>

<xsl:template match="/">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="Dbtag[Dbtag_db='GeneID']">
  <xsl:for-each select="Dbtag_tag/Object-id/Object-id_id">
     <xsl:variable name="genid" select="."/>
    <xsl:variable name="url" select="concat('&lt;a href=" http:="" eutils.ncbi.nlm.nih.gov="" entrez="" eutils="" efetch.fcgi?db="gene&amp;amp;retmode=xml&amp;amp;id=',$genid)" "="" rel="nofollow">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=',$genid)" />
    <xsl:apply-templates select="document($url,//Other-source)" mode="go"/>
  </xsl:for-each>
</xsl:template>

<xsl:template match="Other-source" mode="go">
 <xsl:if test="Other-source_src/Dbtag/Dbtag_db='GO'">
 <xsl:value-of select="concat('GO:',Other-source_src/Dbtag/Dbtag_tag/Object-id/Object-id_id)"/>
<xsl:text>        </xsl:text>
<xsl:value-of select="Other-source_anchor"/>
<xsl:text>
</xsl:text>
 </xsl:if>
</xsl:template>

<xsl:template match="text()">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()" mode="go">
</xsl:template>

</xsl:stylesheet>

Apply this stylesheet with xsltproc for NM_030621:

xsltproc --novalid stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_030621&rettype=xml"

Result:

GO:5524         ATP binding
GO:8026         ATP-dependent helicase activity
GO:3725         double-stranded RNA binding
GO:4519         endonuclease activity
GO:4386         helicase activity
GO:16787        hydrolase activity
GO:46872        metal ion binding
GO:166          nucleotide binding
GO:5515         protein binding
GO:4525         ribonuclease III activity
GO:6396         RNA processing
GO:1525         angiogenesis
GO:48754        branching morphogenesis of a tube
GO:35116        embryonic hindlimb morphogenesis
GO:31047        gene silencing by RNA
GO:30324        lung development
GO:31054        pre-microRNA processing
GO:30422        production of siRNA involved in RNA interference
GO:19827        stem cell maintenance
GO:30423        targeting of mRNA for destruction involved in RNA interference
GO:16442        RNA-induced silencing complex
GO:5737         cytoplasm
GO:5622         intracellular

Ram · Answer 3 · 2014-10-27

7

Entering edit mode

10.1 years ago

adam.maikai ▴ 530

Using MyGene.info's new R/Bioconductor package mygene, this can be done simply with the query() function.

library(mygene)
res<-query('NM_030621', fields='go', species='human')$hits
lapply(res, as.list)

This returns a data.frame for CC, MF, and BP, matching score, and the gene id.

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 10.1 years ago by adam.maikai ▴ 530

Ram · Answer 4 · 2010-05-27

5

Entering edit mode

14.5 years ago

Jandot ▴ 370

Hey Ryan,

Just had a very quick look at this, so am not presenting the whole solution here. However, using the ruby-ensembl-api and given a

[?]

Result:

[?]

GO:0005488
GO:0005515
GO:0007275
GO:0005622
GO:0007155

[?]

This uses the Ensembl stable ID instead of the RefSeq accession, but you should be easily able to link one to the other.

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 14.5 years ago by Jandot ▴ 370

Ram · Answer 5 · 2010-05-27

For my work I've used an R package called CORNA 1 and I find very useful functions for a batch retrieves. Here there is an example code chunk:

# This script retrieve a list of dataframes for crosslinks among RefSeqID, Ensembl IDs (transcript e gene), GO IDs (cellular component, molecular function e biological process)

library(CORNA);

biomart = "ensembl"; # BioMart DB (for a complete list of avalaible databases use "listMarts" function [biomaRt])
dataset = "mmusculus_gene_ensembl"; # dataset
# for a complete list of avalaible datasets use a chunck like this:
# library(biomaRt)
# mart <- useMart('ensembl')
# listDatasets(mart)
org="mmu"; # organism, e.g. mmu -> Mus Musculus

# crosslinks between RefSeq and Ensembl Transcript
refseq2ensembl_tran <-  BioMart2df.fun( biomart=biomart, dataset=dataset,
                                        col.old=c("refseq_dna", "ensembl_transcript_id"),
                                        col.new=c("refseq", "ensembl_transcript"));

# crosslinks between RefSeq and Ensembl Gene
refseq2ensembl_gene <-  BioMart2df.fun( biomart=biomart, dataset=dataset,
                                        col.old=c("refseq_dna", "ensembl_gene_id"),
                                        col.new=c("refseq", "ensembl_gene"));

# crosslinks between Ensembl Transcript and GObp
tran2gobp  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_biological_process_id"),
                                col.new=c("ensembl_transcript", "gobp"));

# crosslinks between Ensembl Transcript and GOcc
tran2gocc  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_cellular_component_id"),
                                col.new=c("ensembl_transcript", "gocc"));

# crosslinks between Ensembl Transcript and GOmf
tran2gomf  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_molecular_function_id"),
                                col.new=c("ensembl_transcript", "gomf"));

# links between GO id and term
corna.go2term <- GO2df.fun(url = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz");

# pathway information from KEGG
corna.gene2path <- KEGG2df.fun(org = org);

Ram · Answer 6 · 2010-05-27

3

Entering edit mode

14.5 years ago

Andrew Su 4.9k

Seems like a good task for BioMart's MartService. Check out their API. You specify your query using their Query XML syntax, and a good way to generate a template is to do the query using the GUI and then export the corresponding XML.

Check out also this previous question, where QuickGO and goatools are mentioned...

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 14.5 years ago by Andrew Su 4.9k

Ram · Answer 7 · 2010-05-29

I was also interested in using the mysql server for ensembl for this query. After doing some little reverse engineering with the SQL schema, I wrote the following SQL query: query:

use homo_sapiens_core_48_36j;

select  distinct
    GENE_ID.stable_id as "ensembl.gene",
    RNA_ID.stable_id as "ensembl.transcript",
    PROT_ID.stable_id as "ensembl.translation",
    GO.acc as "go.acc",
    GO.name as "go.name",
    GOXREF.linkage_type as "evidence"
from

    ensembl_go_54.term as GO,
    external_db as EXTDB0,
    external_db as EXTDB1,
    object_xref as OX0,
    object_xref as OX1,
    xref as XREF0,
    xref as XREF1,
    transcript as RNA,
    transcript_stable_id as RNA_ID,
    gene as GENE,
    gene_stable_id as GENE_ID,
    translation as PROT,
    translation_stable_id as PROT_ID,
    go_xref as GOXREF
where
    XREF0.dbprimary_acc="NM_030621" and
    XREF0.external_db_id=EXTDB0.external_db_id and
    EXTDB0.db_name="RefSeq_dna" and
    OX0.xref_id=XREF0.xref_id and
    RNA.gene_id=GENE.gene_id and
    GENE.gene_id= GENE_ID.gene_id and
    RNA.transcript_id=OX0.ensembl_id and
    RNA_ID.transcript_id=RNA.transcript_id and
    PROT.transcript_id = RNA.transcript_id and
    OX1.ensembl_id=PROT.translation_id and
    PROT.translation_id=PROT_ID.translation_id and
    OX1.ensembl_object_type='Translation' and
    OX1.xref_id=XREF1.xref_id and
    GOXREF.object_xref_id=OX1.object_xref_id and 
    XREF1.external_db_id=EXTDB1.external_db_id and
    EXTDB1.db_name="GO" and
    GO.acc=XREF1.dbprimary_acc

order by GO.acc;

Execute:

mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306 < query.sql

Result:

ensembl.gene    ensembl.transcript      ensembl.translation     go.acc  go.name evidence
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0000166      nucleotide binding      IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0001525      angiogenesis    IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003676      nucleic acid binding    IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003677      DNA binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003723      RNA binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003725      double-stranded RNA binding     IDA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004386      helicase activity       IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004519      endonuclease activity   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004525      ribonuclease III activity       IDA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005515      protein binding IPI
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005524      ATP binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005622      intracellular   NAS
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0006396      RNA processing  IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0008026      ATP-dependent helicase activity IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0016787      hydrolase activity      IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0019827      stem cell maintenance   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030324      lung development        IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030422      RNA interference, production of siRNA   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030423      RNA interference, targeting of mRNA for destruction     IEP
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0031047      gene silencing by RNA   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0035116      embryonic hindlimb morphogenesis        IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0035196      gene silencing by miRNA, production of miRNAs   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0048754      branching morphogenesis of a tube       IEA

Note: my previous solution as well as Neil's one don't list all the GO terms visible here. This SQL looks probably better even if the result is still different with the web page (e.g.: I can see GO:0016442 on the web page, but not in my result ). The ensembl schema is complex, and I might have missed some links (left join ?) between the tables.

Ram · Answer 8 · 2010-05-29

0

Entering edit mode

14.5 years ago

Khader Shameer 18k

You can get a quick look of annotation using GOOSE (GO Online SQL Environment) GOOSE Manual and a long list of sample queries here. For a perl based solution I will try with Onto-perl. Solution for your exact question is here. Loonger list of examples here - a valuable reference that provided most of the anticipated use cases in GO and solution using Onto-Perl.

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 14.5 years ago by Khader Shameer 18k