Question

How Do I Tell Topgo How Scores Work?

4

Entering edit mode

10.9 years ago

Jeremy Leipzig 22k

resultKS <- runTest(sampleGOdata, algorithm = "classic", statistic = "ks")

The ks test is supposed to use gene "scores" to calculate enrichment. It is vaguely implied these scores are p-values (where smaller is better) - but how does topGO know to treat them as such and why would it make this assumption? How could I tell it explicitly to use ascending or descending scores?

• 5.6k views

ADD COMMENT • link updated 7.2 years ago by Fabio Marroni ★ 3.0k • written 10.9 years ago by Jeremy Leipzig 22k

score 5 · Answer 1 · 2014-02-05

5

Entering edit mode

10.9 years ago

Jeremy Leipzig 22k

OK I dug into the source code to find the scoreOrder argument.

Increasing is for p-values, ranks where 1st is best, etc...

resultKS <- runTest(GOdata, algorithm = "classic", statistic = "ks", scoreOrder = "increasing")

ADD COMMENT • link 10.9 years ago by Jeremy Leipzig 22k

score 2 · Answer 2 · 2017-11-02

I know that the question is very old and probably out of date, but I would like to share my findings with you, since I had the very same doubts as you two had!

I didn't use pvalues to rank, but FPKM (this doesn't matter). I wrote a function to select only expressed genes (FPKM>0.1), and then one to select only genes that are NOT expressed, and I ran Fisher and KS tests on them.

You can see the results in the (horrible) image topGO

As you can see, results for KS do not change, because KS, as is implemented in topGO just divide the universe in two sets defined by the threshold function, and then tests for differences in ranks of the two sets. So, it really doesn't matter if you are selecting the significant genes or the non-significant (or expressed, as in my case).

On the contrary, since Fisher's exact test is testing for enrichment in the "selected" genes, (I guess we could say it is one-tailed), the Fisher results depend on which group has been marked as "selected".

score 1 · Answer 3 · 2014-02-05

1

Entering edit mode

10.9 years ago

Lluís R. ★ 1.2k

As far as I know the sampleGOdata comes from:

sampleGOdata <- new("topGOdata", 
                 description="Simple session", 
                 ontology="BP", 
                 allGenes=genes.list, 
                 geneSel= topDiffGenes, #Here is a function that selects the gene list above according to a cutoff, in my case I used logFC not p-values
                 nodeSize=10,           #To choose the precision of the GO database
                 annot= annFUN.gene2GO,   #Change in function of the type of the array we got
                 gene2GO=geneID2GO)  #Object with the relation between gene and GO, readMappings(file)

If I don't remember wrong, the scores are keep in the sampleGOdata and with its scores it calculates the importance of each GOterm where each gene belongs.

Changing the algorithm will change the way of using the scores, but whatever they are or in which order they are they will be used (Here enter the nodeSize in action), to calculate the GOterm importance, so I don't see the meaning of using ascending scores. But maybe I didn't understood well the vignette

But I agree the vignette could be improved, one of this points I don't see clear enough related to the question is if the getSigGroups function and the runTest are the same exactly or what are their differences...

ADD COMMENT • link 10.9 years ago by Lluís R. ★ 1.2k

1

Entering edit mode

so the vignette implies that if you run the fisher test then it just wants a function that tells it whether a gene is in or out (geneSel) - that part is clear. If you run the KS test then it will use the values in allGenes (a named vector). These values should denote whether a gene is more or less differentially expressed, for example. What those values can be (pvals, ranks, score, counts) and which direction they must run is what I am asking.

ADD REPLY • link 10.9 years ago by Jeremy Leipzig 22k

0

Entering edit mode

I thought that it didn't matter the direction, and it automatically read that the lowest value is the best one. But according to your answer, it somehow affects the order to the test, at least for p-values. But then I wonder what happened with my data and the logFC I used...

ADD REPLY • link 10.9 years ago by Lluís R. ★ 1.2k

1

Entering edit mode

if you ran the Fisher test it would be ok - since your geneSel function determines what genes are in or out and that is the only criteria. If you ran the KS test, it would either throw an error b/c of negative values or simply consider the most negative genes as most relevant. That is probably not what you intended.

ADD REPLY • link 10.9 years ago by Jeremy Leipzig 22k