R: Topgo Usage Question
3
13
Entering edit mode
14.3 years ago
Mike Dewar ★ 1.6k

I'm struggling using topGO to do some GO enrichment. Apologies in advance for the trivial nature of this question: I must be misunderstanding something.

I have 200 probesets, and I'd like to see if they're significantly represented in any GO terms. The names of the probesets are in a character vector called interesting_genes. I have an ExpressionSet object called exprset which contains all the data from the microarray experiment I'm working on.

I'd like to use the R package topGO. From one of Khader's answers to a previous post, I know that I should be able to use this package to perform enrichment without having to pass in data of any kind (indeed, most of the web-tools I can use for this only require a list of genes, and to select a background set). It's just I'm having trouble persuading the API to let me do what I want, and I'm having trouble interpreting the errors that result. Here is my code:

genes = factor(as.integer(rownames(exprs(exprset)) %in% interesting_genes))
names(genes) <- rownames(exprs(exprset))

all_genes = factor(rep(0,nrow(exprset))
names(all_genes) <- rownames(exprs(exprset))
levels(all_genes) <- levels(genes)

GOdata <- new("topGOdata", description = "getGO", ontology = "BP", 
    allGenes = all_genes, geneSel = genes, nodeSize = 10, 
    GO2gene = list(mogene10sttranscriptclusterGO2PROBE), annot = annFUN.GO2genes
)

Note the awful construction of all_genes - this must be wrong. But topGO requires a "named object of type numeric of factor", and subsequently demands two factor levels, even though all that really makes sense to pass, in this simple approach to enrichment, is a list of names.

Currently the error I'm getting is

Error in 
if is.na(index) || index < 0 || index > length(nd)) stop(paste("selected vertex",  : 
  missing value where TRUE/FALSE needed

but I don't really know how to interpret this. I'm sure it has something to do with the way I'm forming the object. So my question is:

How do I form a topGO object without using any expression data or scoring information? Specifically, what am I misunderstanding here?

r microarray gene bioconductor • 19k views
ADD COMMENT
0
Entering edit mode

just after I asked this question I realised something about how to pass in the annotations. So have edited the question a bit to cure that particular tidbit of naivete. Still, stuck, though.

ADD REPLY
0
Entering edit mode

Can you post few ids from your list ?

ADD REPLY
0
Entering edit mode

@Khader - yep! "10351026" "10463751" "10537347" "10602176" "10426648" "10462752" ...

ADD REPLY
0
Entering edit mode

Despite the two helpful answers below, I still can't get this guy to work. It's starting to feel more like a Bioconductor list question now, though.

ADD REPLY
0
Entering edit mode

Despite the two helpful answers from Brad and lGautier below, I still can't get this guy to work. It's starting to feel more like a Bioconductor list question now, though.

ADD REPLY
0
Entering edit mode

If you can map your probe ids to genes, then you can easily do the analysis with out expression data or scoring information : see this link for a working code - http://bit.ly/bk6Ylp (posted by Chuangye earlier)

ADD REPLY
0
Entering edit mode

@Khader - thanks! That provided the clues I needed. It turns out that I was maybe making life a bit more complicated than necessary. Shall I put what I did in the end as answer, or just delete the question? I guess this turned out to be a bit more of a Bioconductor mailing list question...

ADD REPLY
0
Entering edit mode

Please don't delete the question. As you already posted a final solution, let this remain as a source for future reference on enrichment with out p-values.

ADD REPLY
13
Entering edit mode
14.3 years ago
Mike Dewar ★ 1.6k

By cobbling together both responses, and the link suggested by Khader, it turns out that this is all I need:

library(topGO)
library(mogene10sttranscriptcluster.db)
# exprset is an Expression Set object
# interesting_genes are the probesets I'd like to perform GO Enrichment on
# first pull out all the probesets
all_genes <- rownames(exprs(exprset))
# then make a factor that is 1 if the probeset is "interesting" and 0 otherwise
geneList <- factor(as.integer (all_genes %in% interesting_genes))
# name the factor with the probeset names
names (geneList) <- allGenes
# form the GOdata object
GOdata <-new ("topGOdata", 
    ontology = "BP", 
    allGenes = geneList, 
    nodeSize = 5, 
    # annot, tells topGO to map from GO terms to "genes"
    annot = annFUN.GO2genes,
    # so annot then calls something to perform this mapping GO2genes, 
    # which is this from the mogene... library
    GO2genes = as.list(mogene10sttranscriptclusterGO2PROBE)
)

So, things are bit simpler than I thought, and the main confusion seemed to be about how to specify allGenes and geneSel. I guess this is a very basic, and maybe not that standard, use of topGO!

ADD COMMENT
3
Entering edit mode

I didn't want to complain, but I think maybe the topGO documentation is a little confusing. Following the example mentioned by Khader, I've constructed allGenes as a factor, indicating which genes I'm interested in, and it seems to work. The subsequent Fisher test seems to be giving me sensible results, and hence I'm loathe to mess with it. Especially as I thought this was going to be a 10 minute job, three days ago. But you're right - the responsible thing to do would be to check with the bioconductor mailing list...

ADD REPLY
1
Entering edit mode

Thanks for your question + answer Mike :-) I was having the same problems as you were having, solved now :-)

ADD REPLY
0
Entering edit mode

Not sure of what your are getting back. The parameter "allGenes" seems to have to be a named numerical vector (try running in R example("topGOdata-class") ). At the same time the man page mentions that "allGenes" should be a vector of strings :/. I'd bump that to the bioconductor mailing list.

ADD REPLY
7
Entering edit mode
14.3 years ago

allGenes should be a named vector where the names are the IDs and the values are p-values, or some other value to help determine the selected and non-selected genes:

> vec <- c(1, 0, 1)
> names(vec) <- c("id1", "id2", "id3")
> vec
id1 id2 id3 
  1   0   1

Then geneSel is a function that uses a p-value to decide if a name is selected or not. For instance, if you wanted values of 0 to be selected, and other not:

> selector <- function(theScore) {
+ return (theScore == 0)}
> selector(0)
[1] TRUE
> selector(1)
[1] FALSE

A while back, I wrote up some Python, Rpy2 and R code using topGO that goes into a more complete start to end example.

GOstats uses universe and test lists of IDs and is worth looking at as well.

ADD COMMENT
0
Entering edit mode

Thanks! I hadn't realised that geneSel should be a function.

ADD REPLY
2
Entering edit mode
14.3 years ago

Giving standalone examples is generally better as it lets one ensure that:

  • the problem is reproducible
  • lowers the overhead for others to reproduce it

(and to make readers feel like they are on the bioconductor mailing-list, giving the sessionInfo() should be mandatory).

The parameter geneSel expands to geneSelectionFunction and should be a function.

Beside that, the construction of all_genes does indeed seem a little daring:

library(Biobase)
data(sample.ExpressionSet)

# having all to zero might cause problems afterwards btw
all_genes <- rep(0, nrow(sample.ExpressionSet))
names(all_genes) <- featureNames(sample.ExpressionSet)
ADD COMMENT
0
Entering edit mode

Thanks for the response. Yeah, I agree with the reproducible problem thing, if this were a programming question, which I guess is how I've written it up. I should have probably asked this as a "what do these bits of the topGO API mean" question but the above seemed clearer at the time. And if the answer is something to do with my sessionInfo then I don't want to be using the package! I'm nervous of any function that will allow my sessionInfo to mess it up. I shall make geneSel into a function - I hadn't picked that up from the documentation at all!

ADD REPLY
0
Entering edit mode

Also that I used length(rownames(exprs(exprset))) instead of nrow(exprset) is slightly embarrassing. Am going to edit my OP and hope no-one reads these comments... The variable all_genes does seem to have to be a factor though, for some reason. Or, at least, that's the error I'm getting right now.

ADD REPLY
0
Entering edit mode

I take this back - about all_genes having to be a factor. Though it's unhappy when all_genes is all zeros (or all ones), if I pass it some real values it's fine.

ADD REPLY
0
Entering edit mode

I take this back - about all_genes having to be a factor. Though it's unhappy when all_genes is all zeros (or all ones), if I pass it some real values it's fine. And when I say fine, I don't mean that it starts working, it just doesn't complain about factors.

ADD REPLY

Login before adding your answer.

Traffic: 1645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6