Is there a table that can be downloaded from FTP or accessed programmatically that links Ensembl ID for a given genome (like 'hg18' or 'mm9') to their GO terms - ids of the form "GO:..."? Is there a UCSC table that does this? I did not see any such table in: http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/
You can create a table fairly easily using R and biomart. The code below makes a table from ensembl, which you could export or write to disk, and also puts the result in a list-like format, which is a convenient R data structure:
library(biomaRt)# select mart and data set
bm <- useMart("ensembl")
bm <- useDataset("mmusculus_gene_ensembl", mart=bm)# Get ensembl gene ids and GO terms
EG2GO <- getBM(mart=bm, attributes=c('ensembl_gene_id','external_gene_id','go_id'))# examine result
head(EG2GO,15)# Remove blank entries
EG2GO <- EG2GO[EG2GO$go_id!='',]# convert from table format to list format
geneID2GO <- by(EG2GO$go_id,
EG2GO$ensembl_gene_id,
function(x) as.character(x))# examine result
head(geneID2GO)# terms can be accessed using gene ids in various ways> geneID2GO$ENSMUSG00000098488[1]"GO:0009395""GO:0008152""GO:0005829""GO:0030659""GO:0004620"[6]"GO:0004623""GO:0046872""GO:0005515"> geneID2GO[['ENSMUSG00000098488']][1]"GO:0009395""GO:0008152""GO:0005829""GO:0030659""GO:0004620"[6]"GO:0004623""GO:0046872""GO:0005515"
On BioMart, when you return a table with the GO accession number, each gene is only associated with a single GO term. Shouldn't there be many GO terms for most genes? Which one does BioMart choose?
The UCSC table browser should have this, though it may require a little digging to get all the relevant info together (it's not exactly user friendly unless you understand SQL). I typically go with biomart myself, which may have the UCSC Ids as well.
Or, if you prefer, using pointy-clicky BioMart. See the help video here.
As I say every few months: the answer to almost every "how to convert ID X to ID Y" question is BioMart, or UCSC tables.
On BioMart, when you return a table with the GO accession number, each gene is only associated with a single GO term. Shouldn't there be many GO terms for most genes? Which one does BioMart choose?