Biomart R package - help writing biomart results into existing file.
2
3
Entering edit mode
10.4 years ago
jsgorams ▴ 30

Just starting to learn R so bear with me. We've conducted an RNA-seq experiment and are working on getting gene ontology information on the sequenced genes. I've gotten the biomart R package installed and have been able to successfully request GO terms given a list of ensembl gene IDs. The issue is that I'd like the results to be placed in a file listing each gene ID in a column and then the GO terms in rows corresponding to each gene they describe. For example:

Gene ID:     GO term:
AGAP000002  Proteolysis, oxidation-reduction, protein-binding, etc
AGAP006543  chitin biosynthesis, intracellular, etc

When I just export the biomart results, it lists each GO number separately and doesn't show which gene they correspond to. Is there an easy way to get this exported correctly? If I need to include anymore information to assist you in answering, just let me know. Thanks.

next-gen R RNA-Seq • 4.9k views
ADD COMMENT
5
Entering edit mode
10.4 years ago
David Fredman ★ 1.1k

Here's a solution using the Bioconductor GO.db package, and Hadley Wickham's dplyr package

1. get GO ids for your genes from Ensmart (from my understanding, this you have already done). This query gets the complete list of genes for a given species (zebrafish in this example) - you may want to filter it.

library(biomaRt)
ensmart = useMart('ensembl', 'drerio_gene_ensembl')
dat = getBM(attributes=c('go_id', 'ensembl_gene_id'), mart = ensmart)

2. Get corresponding GO names for each GO term and concatenate them by gene

library(GO.db)
library(dplyr)

report <- filter(dat, !go_id == "") %>% # remove empty go_ids
  mutate(name = Term(go_id)) %>%        # get corresponding GO term names from GO.db by go_id
  group_by(ensembl_gene_id) %>%         # sort by gene id
  summarize(go_descriptions=do.call(paste, c(as.list(name), sep=","))) #concatenate GO names to one field

write.table(report,file="geneId_goDescription.txt")

Alternatively, if you are using a mart that holds the go names in the database as well as the accessions, eg the plant mart

ens_plant_mart = useMart('ENSEMBL_MART_PLANT', 'osativa_eg_gene')
attrs = c("ensembl_gene_id", "go_accession", "go_name_1006", "go_linkage_type")
results = getBM(attributes = attrs,  mart = ens_plant_mart)

... then you don't need the GO.db package, and can modify the second part (aggregation) as follows (this also concatenates a second field, the accession number)

report = filter(results, !go_accession == "") %>%
         group_by(ensembl_gene_id) %>%
         summarize(go_accessions = do.call(paste, </code>c(as.list(go_accession), sep=",")),
                   go_names = do.call(paste, c(as.list(go_name_1006), sep=",")))
ADD COMMENT
0
Entering edit mode

I think this is on the right track but I can't quite get it to work. I used the biomaRt package to retrieve the GO terms:

results <- getBM(
                attributes = c("ensembl_gene_id", "go_accession", "go_name_1006", "go_linkage_type"),
                filters="ensembl_gene_id",
                values=geneID[1],
                mart=mart)

This worked fine and it resulted in this format:

head(results)
  ensembl_gene_id go_accession                     go_name_1006 go_linkage_type
1      AGAP004677   GO:0016491          oxidoreductase activity                
2      AGAP004677   GO:0044281 small molecule metabolic process                
3      AGAP004677   GO:0009058             biosynthetic process                
4      AGAP004677   GO:0003674               molecular_function                
5      AGAP004677   GO:0055114      oxidation-reduction process             IEA
6      AGAP004677   GO:0003824               catalytic activity             IEA

Since there are multiple GO terms for each gene ID, I'd like to group gene IDs together and so there's only one row for each gene ID with the corresponding GO terms place in a single cell. So the above result would then be:

ensembl_gene_id      go_accession        go_name_1006
                     GO:0016491,         oxidoreductase activity, 
                     GO:0044281,         small molecule metabolic process, 
AGAP004677           GO:0009058,         biosynthetic process, 
                     GO:0003674,         molecular_function, 
                     GO:0055114,         oxidation-reduction process, catalytic activity
                     GO:0003824

How should I go about doing this?

ADD REPLY
0
Entering edit mode

If you are using a mart that holds the go names in the db as well as the accessions, eg the plant mart, and additionally want to output the accession numbers concatenated (not included in the original question) you can skip using the GO.db package, and modify the second part (aggregation) as follows

report = filter(results, !go_accession == "") %>%
         group_by(ensembl_gene_id) %>%
         summarize(go_accessions = do.call(paste, c(as.list(go_accession), sep=",")),
                   go_names = do.call(paste, c(as.list(go_name_1006), sep=",")))

Note that dplyr outputs results using a modified data.frame object that hides variables not fitting on the screen (your long concatenated strings). If you export it to a textfile, and it'll all be there. If you want to view all of it, you can cast it to a regular data.frame object by adding an additional %>% as.data.frame() at the end of the pipe.

I'll modify my answer above to include this solution.

ADD REPLY
2
Entering edit mode
10.4 years ago
lkmklsmn ▴ 980

You should tell biomaRt to also return the geneIDs

e.g.

getBM(attributes=c('GOterm','geneID'),filters='geneID',value=values,mart=ensembl)
ADD COMMENT
0
Entering edit mode

Yeah I did that as well. So I have the results, I just need to get them in the right layout for output. Do you have any input?

ADD REPLY
0
Entering edit mode

Ok. Now I understand. I think you should use the split() function in R.

In your example you would do:

asplit<-split(1:nrow(results),results[,1])

This will give you a list of indices for each gene. Now you can iterate over this list to combine all entries for the same gene.

goAccession<-lapply(asplit,function(x) results[x,2])
goName<-lapply(asplit,function(x) results[x,3])

I haven't run the code, but I think you could almost copy and paste this and it should be close to what you want (at least from my understanding).

Hope this helps

ADD REPLY

Login before adding your answer.

Traffic: 1846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6