does anyone have experience with the goseq package in R?
I am trying to run a GO enrichment-test of drosophila RNA-seq data set, but unfortunately I encountered some problems with this package.
First, the package contains fast no drosophila gene IDs
Second, the package accepts only Entrez IDs.
I have the Flybase IDs (FBgnXXXXXXX) and would like to convert them as easy as possible to Entrez IDs.
Do you have any Ideas as to how to do it?
Are there any better option/R-packages to run a GO enrichment test?
Please be aware that while goseq is very useful for expression based analysis where you want to normalize for transcript length, it does not do the over representation analysis itself. Also see this question: Bioconductor Goseq - Overrepresented P-Values
The vignette says:
goseq will work with any method for determining differential expression and as such differential expression analysis is outside the scope of this document, but in order to facilitate ease of use, we will make use of the edgeR package to calculate differentially expressed (DE) genes in all the case studies in this document.
So it normally is edgeR that does the actual enrichment analysis.
For the mapping you could also use BridgeDB with the Drosophila Database available from the PathVisio download page, or with any of the supported mapping services. BridgeDB could be used as a local webservice that you could can call from R. Alternatively you can use BatchMapper, which is a BridgeDB based standalone tool. I am not sure whether that would solve your duplication problems.
Everything you need to instal BridgeDB should be here. If it is not please file a bug report or mail the developers list.
I do think that goseq do the enrichment analysis on its own.
"This package provides methods for performing Gene Ontology analysis of RNA-seq data, taking length bias into account" [Quote]
They don't do differential expression analysis, but if I understand it correctly, the package was created for enrichment calculations.
I do think that goseq do the enrichment analysis on its own. "This package provides methods for performing Gene Ontology analysis of RNA-seq data, taking length bias into account" [Quote] They don't do differential expression analysis, but if I understand it correctly, the package was created for enrichment calculations.
To run the BridgeDB I need the lib files, but I can't find them on the web site of BridgeDB. Do you have a clue where they are or if I still need them.
You could try the GO term enrichment analysis with the GeneAnswers package, and use the bioconductor annotation packages for the ID mapping. The GeneAnswers documentation has improved since the package came out. And once you get the hang of the annotation packages, they seem fairly straightforward. I don't know how current they are, but I find them useful.
Here is a code snippet illustrating both:
library("GeneAnswers")
library("org.Dm.eg.db")
library("GO.db")# get named vector of entrez ids
fb.entrez <- unlist(as.list(org.Dm.egFLYBASE2EG))# for a data frame x with flybase ids (column 1) and data values (column 2)# match the flybase names against the vector of entrez ids
iv <- match(x[,1], names(fb.entrez))# add a column for entrez ids
x <- cbind(x,rep(NA, nrow(x)))# fill it in by mapping the entez ids onto the matching flybase ids
x[,3]<- fb.entrez[iv]# now you can do some GO analysis# for an index vector "myTopHits" of your top data
topset <- x[myTopHits,3]# remove entries that had no matching entrez id(NA)
topset <- topset[!is.na(topset)]# Get BP enrichment
foo <- geneAnswersBuilder(topset, 'org.Dm.eg.db', categoryType='GO.BP', testType='hyperG')
go.bp <- foo@enrichmentInfo
Thanks for the reply.
Yes, I am familiar with biomart, but I have found duplications in the newest version of biomart (R package) of IDs which are already not in use in the entrez site (NCBI).
I get a lot of duplications, which than need to be extracted.
It would be nice to have another option to do such an analysis without the need to convert data this way and that way.
I agree, it's annoying when tools require specific IDs. However, at least BioMart makes obtaining them relatively easy. I'm finding R biomaRt rather flaky at the moment, so I'm sticking with the website.
I do think that goseq do the enrichment analysis on its own. "This package provides methods for performing Gene Ontology analysis of RNA-seq data, taking length bias into account" [Quote] They don't do differential expression analysis, but if I understand it correctly, the package was created for enrichment calculations.
I do think that goseq do the enrichment analysis on its own. "This package provides methods for performing Gene Ontology analysis of RNA-seq data, taking length bias into account" [Quote] They don't do differential expression analysis, but if I understand it correctly, the package was created for enrichment calculations.
To run the BridgeDB I need the lib files, but I can't find them on the web site of BridgeDB. Do you have a clue where they are or if I still need them.
I tried to update my answer so it covers your comments.The part about the vignette was already in the other question.