Is there a table that can be downloaded from FTP or accessed programmatically that links Ensembl ID for a given genome (like 'hg18' or 'mm9') to their GO terms - ids of the form "GO:..."? Is there a UCSC table that does this? I did not see any such table in: http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/
You can create a table fairly easily using R and biomart. The code below makes a table from ensembl, which you could export or write to disk, and also puts the result in a list-like format, which is a convenient R data structure:
library(biomaRt)
# select mart and data set
bm <- useMart("ensembl")
bm <- useDataset("mmusculus_gene_ensembl", mart=bm)
# Get ensembl gene ids and GO terms
EG2GO <- getBM(mart=bm, attributes=c('ensembl_gene_id','external_gene_id','go_id'))
# examine result
head(EG2GO,15)
# Remove blank entries
EG2GO <- EG2GO[EG2GO$go_id != '',]
# convert from table format to list format
geneID2GO <- by(EG2GO$go_id,
EG2GO$ensembl_gene_id,
function(x) as.character(x))
# examine result
head(geneID2GO)
# terms can be accessed using gene ids in various ways
> geneID2GO$ENSMUSG00000098488
[1] "GO:0009395" "GO:0008152" "GO:0005829" "GO:0030659" "GO:0004620"
[6] "GO:0004623" "GO:0046872" "GO:0005515"
> geneID2GO[['ENSMUSG00000098488']]
[1] "GO:0009395" "GO:0008152" "GO:0005829" "GO:0030659" "GO:0004620"
[6] "GO:0004623" "GO:0046872" "GO:0005515"
On BioMart, when you return a table with the GO accession number, each gene is only associated with a single GO term. Shouldn't there be many GO terms for most genes? Which one does BioMart choose?
The UCSC table browser should have this, though it may require a little digging to get all the relevant info together (it's not exactly user friendly unless you understand SQL). I typically go with biomart myself, which may have the UCSC Ids as well.
Or, if you prefer, using pointy-clicky BioMart. See the help video here.
As I say every few months: the answer to almost every "how to convert ID X to ID Y" question is BioMart, or UCSC tables.
On BioMart, when you return a table with the GO accession number, each gene is only associated with a single GO term. Shouldn't there be many GO terms for most genes? Which one does BioMart choose?