How Do I Do Simple Go Term Lookup Given A Gene (Or Mrna) Identifier?
8
21
Entering edit mode
14.5 years ago
Ryan Thompson ★ 3.6k

Suppose I have a gene ID such as NM_030621, which happens to be one of the human mRNA splice variants of DICER1. How can I programmatically get a list of GO terms with which this ID is associated? Going through protein or gene IDs as intermediaries is acceptable.

I want to do this for a whole list of such IDs, all corresponding to RefSeq mRNA sequences. I don't want to do enrichment analysis or anything, just get a list of GO terms for each mRNA.

Preferred languages are Perl and R.

gene annotation database • 37k views
ADD COMMENT
2
Entering edit mode

Thanks for all the answers, everyone. I ended up using biomaRt in R.

ADD REPLY
0
Entering edit mode

Nice question, awesome answers, just added my suggestion using GOOSE.

ADD REPLY
0
Entering edit mode

Thank you for this post ! I can't believe that something so elemental would be so complicated.

ADD REPLY
20
Entering edit mode
14.5 years ago
Neilfws 49k

Using R, you can do this with the biomaRt package. Here is a quick example:

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
results <- getBM(attributes = c("refseq_dna", "go_molecular_function_id",
           "go_molecular_function__dm_name_1006"), filters = "refseq_dna",
           values = c("NM_030621"), mart = mart)

In the getBM() function, we specify which attributes to return (RefSeq ID, GO MF ID, GO MF Name), the filters to use for the query (RefSeq ID) and the filter values (NM_030621). You could also pass a list of values, e.g. from a data frame column as mydf$column.

Result:

        refseq_dna  go_molecular_function_id go_molecular_function__dm_name_1006
1       NM_030621                 GO:0005515                     protein binding
2       NM_030621                 GO:0008026     ATP-dependent helicase activity
3       NM_030621                 GO:0000166                  nucleotide binding
4       NM_030621                 GO:0000287               magnesium ion binding
5       NM_030621                 GO:0004386                   helicase activity
6       NM_030621                 GO:0004519               endonuclease activity
7       NM_030621                 GO:0016787                  hydrolase activity
8       NM_030621                 GO:0030145               manganese ion binding
9       NM_030621                 GO:0004525           ribonuclease III activity
10      NM_030621                 GO:0003725         double-stranded RNA binding
11      NM_030621                 GO:0005524                         ATP binding
12      NM_030621                 GO:0003723                         RNA binding
13      NM_030621                 GO:0003676                nucleic acid binding
14      NM_030621                 GO:0003677                         DNA binding

You could also use the BioMart website, where you can upload a list of IDs, specify attributes and filters and download results as delimited text files.

ADD COMMENT
0
Entering edit mode

Thanks, that worked great. Biomart looks incredibly useful.

ADD REPLY
0
Entering edit mode

Neil, do you known why there are more GO identifiers on ensembl for this protein?

ADD REPLY
0
Entering edit mode

I think the Ensembl page shows all three GO ontologies (BP, CC and MF) - the example above is only MF.

ADD REPLY
0
Entering edit mode

When I try this, I find that 'refseq_dna', 'go_molecular_function_id', and 'go_molecular_function__dm_name_1006' are not valid attributes. Has something changed?

ADD REPLY
0
Entering edit mode

But, in this way, would not you get limited to the RefSeq ID's which have Ensemble information in the database?

ADD REPLY
0
Entering edit mode

What if your organism is not in ensembl list anymore, like deinococcus radiodurans?

ADD REPLY
0
Entering edit mode

How to do this for only BP goterms?

ADD REPLY
7
Entering edit mode
14.5 years ago

My XSLT solution:

The stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >

<xsl:output method="text" encoding="ISO-8859-1"/>

<xsl:template match="/">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="Dbtag[Dbtag_db='GeneID']">
  <xsl:for-each select="Dbtag_tag/Object-id/Object-id_id">
     <xsl:variable name="genid" select="."/>
    <xsl:variable name="url" select="concat('&lt;a href=" http:="" eutils.ncbi.nlm.nih.gov="" entrez="" eutils="" efetch.fcgi?db="gene&amp;amp;retmode=xml&amp;amp;id=',$genid)" "="" rel="nofollow">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=',$genid)" />
    <xsl:apply-templates select="document($url,//Other-source)" mode="go"/>
  </xsl:for-each>
</xsl:template>

<xsl:template match="Other-source" mode="go">
 <xsl:if test="Other-source_src/Dbtag/Dbtag_db='GO'">
 <xsl:value-of select="concat('GO:',Other-source_src/Dbtag/Dbtag_tag/Object-id/Object-id_id)"/>
<xsl:text>        </xsl:text>
<xsl:value-of select="Other-source_anchor"/>
<xsl:text>
</xsl:text>
 </xsl:if>
</xsl:template>

<xsl:template match="text()">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()" mode="go">
</xsl:template>

</xsl:stylesheet>

Apply this stylesheet with xsltproc for NM_030621:

xsltproc --novalid stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_030621&rettype=xml"

Result:

GO:5524         ATP binding
GO:8026         ATP-dependent helicase activity
GO:3725         double-stranded RNA binding
GO:4519         endonuclease activity
GO:4386         helicase activity
GO:16787        hydrolase activity
GO:46872        metal ion binding
GO:166          nucleotide binding
GO:5515         protein binding
GO:4525         ribonuclease III activity
GO:6396         RNA processing
GO:1525         angiogenesis
GO:48754        branching morphogenesis of a tube
GO:35116        embryonic hindlimb morphogenesis
GO:31047        gene silencing by RNA
GO:30324        lung development
GO:31054        pre-microRNA processing
GO:30422        production of siRNA involved in RNA interference
GO:19827        stem cell maintenance
GO:30423        targeting of mRNA for destruction involved in RNA interference
GO:16442        RNA-induced silencing complex
GO:5737         cytoplasm
GO:5622         intracellular
ADD COMMENT
0
Entering edit mode

Interesting solution!

ADD REPLY
0
Entering edit mode

Pierre could you share your experience/comments on when to go the xslt route for such tasks?

ADD REPLY
0
Entering edit mode

... hum, just a kind of feeling... :-) I first think of XSLT when I know that I can get the source of data as XML, when this source is not too large (must fit in memory), when speed is not an issue and I don't need a complicated processing (e.g. tokenizing, aggregation of data (mean/uniq/max/etc...) )

ADD REPLY
7
Entering edit mode
10.1 years ago
adam.maikai ▴ 530

Using MyGene.info's new R/Bioconductor package mygene, this can be done simply with the query() function.

library(mygene)
res<-query('NM_030621', fields='go', species='human')$hits
lapply(res, as.list)

This returns a data.frame for CC, MF, and BP, matching score, and the gene id.

ADD COMMENT
5
Entering edit mode
14.5 years ago
Jandot ▴ 370

Hey Ryan,

Just had a very quick look at this, so am not presenting the whole solution here. However, using the ruby-ensembl-api and given a

[?]

Result:

[?]

GO:0005488
GO:0005515
GO:0007275
GO:0005622
GO:0007155

[?]

This uses the Ensembl stable ID instead of the RefSeq accession, but you should be easily able to link one to the other.

ADD COMMENT
4
Entering edit mode
14.5 years ago

For my work I've used an R package called CORNA 1 and I find very useful functions for a batch retrieves. Here there is an example code chunk:

# This script retrieve a list of dataframes for crosslinks among RefSeqID, Ensembl IDs (transcript e gene), GO IDs (cellular component, molecular function e biological process)

library(CORNA);

biomart = "ensembl"; # BioMart DB (for a complete list of avalaible databases use "listMarts" function [biomaRt])
dataset = "mmusculus_gene_ensembl"; # dataset
# for a complete list of avalaible datasets use a chunck like this:
# library(biomaRt)
# mart <- useMart('ensembl')
# listDatasets(mart)
org="mmu"; # organism, e.g. mmu -> Mus Musculus

# crosslinks between RefSeq and Ensembl Transcript
refseq2ensembl_tran <-  BioMart2df.fun( biomart=biomart, dataset=dataset,
                                        col.old=c("refseq_dna", "ensembl_transcript_id"),
                                        col.new=c("refseq", "ensembl_transcript"));

# crosslinks between RefSeq and Ensembl Gene
refseq2ensembl_gene <-  BioMart2df.fun( biomart=biomart, dataset=dataset,
                                        col.old=c("refseq_dna", "ensembl_gene_id"),
                                        col.new=c("refseq", "ensembl_gene"));

# crosslinks between Ensembl Transcript and GObp
tran2gobp  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_biological_process_id"),
                                col.new=c("ensembl_transcript", "gobp"));

# crosslinks between Ensembl Transcript and GOcc
tran2gocc  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_cellular_component_id"),
                                col.new=c("ensembl_transcript", "gocc"));

# crosslinks between Ensembl Transcript and GOmf
tran2gomf  <- BioMart2df.fun(   biomart=biomart, dataset=dataset,
                                col.old=c("ensembl_transcript_id", "go_molecular_function_id"),
                                col.new=c("ensembl_transcript", "gomf"));

# links between GO id and term
corna.go2term <- GO2df.fun(url = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz");

# pathway information from KEGG
corna.gene2path <- KEGG2df.fun(org = org);
ADD COMMENT
0
Entering edit mode

Thank you very much for this link, because I just happen to be doing GO-term enrichment analysis on miRNA targets.

ADD REPLY
3
Entering edit mode
14.5 years ago
Andrew Su 4.9k

Seems like a good task for BioMart's MartService. Check out their API. You specify your query using their Query XML syntax, and a good way to generate a template is to do the query using the GUI and then export the corresponding XML.

Check out also this previous question, where QuickGO and goatools are mentioned...

ADD COMMENT
2
Entering edit mode
14.5 years ago

I was also interested in using the mysql server for ensembl for this query. After doing some little reverse engineering with the SQL schema, I wrote the following SQL query: query:

use homo_sapiens_core_48_36j;

select  distinct
    GENE_ID.stable_id as "ensembl.gene",
    RNA_ID.stable_id as "ensembl.transcript",
    PROT_ID.stable_id as "ensembl.translation",
    GO.acc as "go.acc",
    GO.name as "go.name",
    GOXREF.linkage_type as "evidence"
from

    ensembl_go_54.term as GO,
    external_db as EXTDB0,
    external_db as EXTDB1,
    object_xref as OX0,
    object_xref as OX1,
    xref as XREF0,
    xref as XREF1,
    transcript as RNA,
    transcript_stable_id as RNA_ID,
    gene as GENE,
    gene_stable_id as GENE_ID,
    translation as PROT,
    translation_stable_id as PROT_ID,
    go_xref as GOXREF
where
    XREF0.dbprimary_acc="NM_030621" and
    XREF0.external_db_id=EXTDB0.external_db_id and
    EXTDB0.db_name="RefSeq_dna" and
    OX0.xref_id=XREF0.xref_id and
    RNA.gene_id=GENE.gene_id and
    GENE.gene_id= GENE_ID.gene_id and
    RNA.transcript_id=OX0.ensembl_id and
    RNA_ID.transcript_id=RNA.transcript_id and
    PROT.transcript_id = RNA.transcript_id and
    OX1.ensembl_id=PROT.translation_id and
    PROT.translation_id=PROT_ID.translation_id and
    OX1.ensembl_object_type='Translation' and
    OX1.xref_id=XREF1.xref_id and
    GOXREF.object_xref_id=OX1.object_xref_id and 
    XREF1.external_db_id=EXTDB1.external_db_id and
    EXTDB1.db_name="GO" and
    GO.acc=XREF1.dbprimary_acc

order by GO.acc;

Execute:

mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306 < query.sql

Result:

ensembl.gene    ensembl.transcript      ensembl.translation     go.acc  go.name evidence
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0000166      nucleotide binding      IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0001525      angiogenesis    IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003676      nucleic acid binding    IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003677      DNA binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003723      RNA binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0003725      double-stranded RNA binding     IDA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004386      helicase activity       IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004519      endonuclease activity   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0004525      ribonuclease III activity       IDA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005515      protein binding IPI
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005524      ATP binding     IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0005622      intracellular   NAS
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0006396      RNA processing  IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0008026      ATP-dependent helicase activity IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0016787      hydrolase activity      IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0019827      stem cell maintenance   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030324      lung development        IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030422      RNA interference, production of siRNA   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0030423      RNA interference, targeting of mRNA for destruction     IEP
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0031047      gene silencing by RNA   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0035116      embryonic hindlimb morphogenesis        IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0035196      gene silencing by miRNA, production of miRNAs   IEA
ENSG00000100697 ENST00000393063 ENSP00000376783 GO:0048754      branching morphogenesis of a tube       IEA

Note: my previous solution as well as Neil's one don't list all the GO terms visible here. This SQL looks probably better even if the result is still different with the web page (e.g.: I can see GO:0016442 on the web page, but not in my result ). The ensembl schema is complex, and I might have missed some links (left join ?) between the tables.

ADD COMMENT
0
Entering edit mode

ok, it should work better with homo_sapiens_core_58_37c Thanks @joachimbaran

https://twitter.com/joachimbaran/status/14981387076

ADD REPLY
0
Entering edit mode
14.5 years ago

You can get a quick look of annotation using GOOSE (GO Online SQL Environment) GOOSE Manual and a long list of sample queries here. For a perl based solution I will try with Onto-perl. Solution for your exact question is here. Loonger list of examples here - a valuable reference that provided most of the anticipated use cases in GO and solution using Onto-Perl.

ADD COMMENT

Login before adding your answer.

Traffic: 2610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6