I am afraid I am doing something wrong with topGO. If I perform an enrichment analysis specifying the genes of interest (and the corresponding GO slim terms that I got by biomaRt), in the results I also get GO terms that are not associated to the gene of interest. What is going on?
Here an example. I use a very small file including only one gene (you can copy on a text editor, save it as mygenes.txt, maybe convert spaces to tab, and then read it in R using read.table).
ensembl_gene_id chromosome_name start_position end_position
VIT_200s1458g00010 Un 39225125 39226050
goslim_goa_accession
GO:0005575;GO:0016740;GO:0016301;GO:0000166;GO:0005488;GO:0003674;GO:0005623;GO:0005886;GO:0016020;GO:0003824;GO:0006464;GO:0008150;GO:0008152;GO:0009987;GO:0019538
I paste the code I used for analysis
#Read the only gene and select it as the interesting gene
mygenes<-read.table("mygenes.txt",header=T,stringsAsFactors=F)
#Create geneList object to be specified as the allGenes parameter for creating a topGO object
myInterestingGenes<-mygenes$ensembl_gene_id
finto <- as.integer(mygenes$ensembl_gene_id %in% myInterestingGenes)
names(finto) <- mygenes$ensembl_gene_id
#Create the gene2GO object
mygene2GO<-as.list(strsplit(mygenes$goslim_goa_accession,";"))
mygene2GO<-setNames(mygene2GO,mygenes$ensembl_gene_id)
#For selecting genes to be used in analysis I defined a function called topgosel (it is an overkilling for the example, but useful for the real analysis)
topgosel<-function(x)
{
y<-x>0
return(y)
}
#Finally create the topGO object
mytopgo<-new("topGOdata",ontology="BP",allGenes=finto,annot = annFUN.gene2GO, geneSel= topgosel , gene2GO = mygene2GO)
sampleGOdata<-mytopgo
#Run fisher test
resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher")
#Extract Fisher results
allRes <- GenTable(sampleGOdata, classicFisher = resultFisher,topNodes = min(resultFisher@geneData[4],100))
Now, the gene used for testing and its GO is the following:
> mygene2GO
$VIT_200s1458g00010
[1] "GO:0005575" "GO:0016740" "GO:0016301" "GO:0000166" "GO:0005488"
[6] "GO:0003674" "GO:0005623" "GO:0005886" "GO:0016020" "GO:0003824"
[11] "GO:0006464" "GO:0008150" "GO:0008152" "GO:0009987" "GO:0019538"
While in the results I get these GO terms
> allRes$GO.ID
[1] "GO:0006464" "GO:0008150" "GO:0008152" "GO:0009987" "GO:0019538"
[6] "GO:0036211" "GO:0043170" "GO:0043412" "GO:0044237" "GO:0044238"
[11] "GO:0044260" "GO:0044267" "GO:0071704"
As you can see, there are some GO terms that are present only in the results, and some that are present only in the input file. This sounds strange to me: it's like topGO decides its own ontology for the gene under investigation.
I am sure I am missing (or messing) something, but what?
Can anyone help me? Do you think it's appropriate if I post the same question to the bioconductor support forum?
I could reproduce your error, but I don't know what causes it. Are these unexpected GO terms connected to the ones actually associated to your gene? Maybe their significance could come from the GO network structure, but that would be strange. It looks like a bug to me, you should definitely ask about it in bioconductor.