Question

topGO is performing enrichment test on GO terms that are not present in my dataset

1

Entering edit mode

6.9 years ago

Fabio Marroni ★ 3.0k

I am afraid I am doing something wrong with topGO. If I perform an enrichment analysis specifying the genes of interest (and the corresponding GO slim terms that I got by biomaRt), in the results I also get GO terms that are not associated to the gene of interest. What is going on?

Here an example. I use a very small file including only one gene (you can copy on a text editor, save it as mygenes.txt, maybe convert spaces to tab, and then read it in R using read.table).

         ensembl_gene_id chromosome_name start_position end_position
VIT_200s1458g00010              Un       39225125     39226050
                                                                                                                                                  goslim_goa_accession
GO:0005575;GO:0016740;GO:0016301;GO:0000166;GO:0005488;GO:0003674;GO:0005623;GO:0005886;GO:0016020;GO:0003824;GO:0006464;GO:0008150;GO:0008152;GO:0009987;GO:0019538

I paste the code I used for analysis

#Read the only gene and select it as the interesting gene    
mygenes<-read.table("mygenes.txt",header=T,stringsAsFactors=F)
#Create geneList object to be specified as the allGenes parameter for creating a topGO object


myInterestingGenes<-mygenes$ensembl_gene_id
finto <- as.integer(mygenes$ensembl_gene_id %in% myInterestingGenes)
names(finto) <- mygenes$ensembl_gene_id
#Create the gene2GO object
mygene2GO<-as.list(strsplit(mygenes$goslim_goa_accession,";"))
mygene2GO<-setNames(mygene2GO,mygenes$ensembl_gene_id)


#For selecting genes to be used in analysis I defined a function called topgosel (it is an overkilling for the example, but useful for the real analysis)
topgosel<-function(x)
{
y<-x>0
return(y)
}

#Finally create the topGO object
mytopgo<-new("topGOdata",ontology="BP",allGenes=finto,annot = annFUN.gene2GO, geneSel= topgosel , gene2GO = mygene2GO)
sampleGOdata<-mytopgo
#Run fisher test
resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher")
#Extract Fisher results
allRes <- GenTable(sampleGOdata, classicFisher = resultFisher,topNodes = min(resultFisher@geneData[4],100))

Now, the gene used for testing and its GO is the following:

> mygene2GO
    $VIT_200s1458g00010
 [1] "GO:0005575" "GO:0016740" "GO:0016301" "GO:0000166" "GO:0005488"
 [6] "GO:0003674" "GO:0005623" "GO:0005886" "GO:0016020" "GO:0003824"
[11] "GO:0006464" "GO:0008150" "GO:0008152" "GO:0009987" "GO:0019538"

While in the results I get these GO terms

> allRes$GO.ID
 [1] "GO:0006464" "GO:0008150" "GO:0008152" "GO:0009987" "GO:0019538"
 [6] "GO:0036211" "GO:0043170" "GO:0043412" "GO:0044237" "GO:0044238"
[11] "GO:0044260" "GO:0044267" "GO:0071704"

As you can see, there are some GO terms that are present only in the results, and some that are present only in the input file. This sounds strange to me: it's like topGO decides its own ontology for the gene under investigation.

I am sure I am missing (or messing) something, but what?

Can anyone help me? Do you think it's appropriate if I post the same question to the bioconductor support forum?

topGO gene ontology bioconductor R • 4.7k views

ADD COMMENT • link updated 6.8 years ago by Pablo Markaide ▴ 20 • written 6.9 years ago by Fabio Marroni ★ 3.0k

1

Entering edit mode

I could reproduce your error, but I don't know what causes it. Are these unexpected GO terms connected to the ones actually associated to your gene? Maybe their significance could come from the GO network structure, but that would be strange. It looks like a bug to me, you should definitely ask about it in bioconductor.

ADD REPLY • link 6.9 years ago by Martombo ★ 3.1k

score 2 · Answer 1 · 2018-02-01

Hi Fabio,

I'm not an expert in topGO, but digging in my old notes (when I was preparing the code for topGO batch analysis), I have comments on how topGO ADDs some parent nodes if they are missing on your sets. In my previous test with small datasets, the addition was bigger compared to more comprehensive gene sets.

If you play with NodeSize during new GOdata object generation, you can see that the number of GO actually is bigger than my universe depending on that parameter:

    #Number below represent unique GOs

    length(unique(unlist(geneID2GO))) #my gene universe
    1936
    length(usedGO(GOdata)) #with Nodesize=5
    788
    length(usedGO(GOdata)) #with Nodesize=1
    2534

Maybe I'm completely wrong but I'm with @Martombo that you should check the network structure to test if this added GOs you mention are in fact part of the GO hierarchy