Question

Hypergeometric Test R/Bioconductor Result Categorization

2

Entering edit mode

14.0 years ago

Thaman ★ 3.3k

Hi,

This problem is in R/Bioconductor, topGO and GOstats packages. I want to perform HyperGeometric test for OVER representation against GO and KEGG. I have go two text files locally available: Back.txt (background with only entrez id and more then 2000 ID's- Illumina HT12 v3. microarray) and **genes.txt** (differently expressed genes- Illumina HT12 v3. microarray). The result should be in R data.frame with following fileds (GO_term_id, Go_term_name, p_value and number of associated genes from my genes list(Genes.txt file)) and visualizing in the Gograph or KEGG pathway.

I have perform hyperGTest and got the result but I am not sure whether the generated result is correct or not as per my OVER represented need against GO and KEGG. How can I check my result with other databases for consistency? I have tried with DAVID but couldn't figure out.

I have following code

 library(topGO)

 library(GOstats)

 universe=read.table("Back.txt", sep=",")  # Background files where only entrez id's are listed without heading column

 tbl <- read.table ("genes.txt", sep=",")  # selected genes with following header Probes_id,entrez_gene_id,symbols,P.Value and F.C

 selected=<-tbl$V2  # Selecting only second column of tbl vector where entrez_gene_id is present


 param<- new ("GOHyperGParams", geneIds = selected,

 universeGeneIds=universe, annotation="org.Hs.eg.db",

 ontology="BP",pvalueCutoff=0.1, conditional=FALSE,testDirection="over")

 hyp<-hyperGTest(param)

 summary(hyp)

For the trial I wrote summary(hyp) which gives me GOBPID, Pvalue, OddsRatio,ExpCount,Count, Size and Term.

But I want to get result in data.frame with following fields consisting GO_term_id, Go_term_name, p_value and number of associated genes from my genes list(Genes.txt file)

I have provided the genes.txt sample link for convinence and Back.txt contains only entrez_id

*Note- This is a follow up question.

r bioconductor kegg gene enrichment • 8.3k views

ADD COMMENT • link updated 13.9 years ago by Brad Chapman 9.7k • written 14.0 years ago by Thaman ★ 3.3k

1

Entering edit mode

Your question is unclear. You appear to be doing your test against the Biological Process GO field in the annotation db. So why would this return the KEGG pathway in any subsequent data frame? If you want to extract the information from the annotation database to integrate with the data, then you need to learn to look stuff up in the annotation package. The data summary(hyp) returns all the relevant information for your query, just as analysing it in DAVID would.

ADD REPLY • link 14.0 years ago by User 59 13k

1

Entering edit mode

Hi Thaman, there are only few possible reason for not getting an answer: nobody knows, nobody cares, or nobody understood the question. Mainly here its a mixture, caused by point # three. People have asked you repeatedly for a reproducible example. Something that I could paste in my R session and run it. Make the sample data downloadable (all your inputs required + genes.txt) and even ppl not totally familiar with the package might help you out. I guess I could help but if you keep ignoring comments..., so please stop whining and re-opening and put an reproducible example

ADD REPLY • link 14.0 years ago by Michael 55k

0

Entering edit mode

My problem with this question is that it seems to be asking about 3 different things. It's very unclear what you are trying to achieve, particularly without sample data.

ADD REPLY • link 14.0 years ago by Neilfws 49k

0

Entering edit mode

@ Daniel- You are right about the confusion, but I have mentioned this in my previous question. I have two options either to do it from GO or KEGG so that's why result also changes according to annotation packages I have been using. I tried to look GOstats and topGO but I din't find extracting specific information in result sections like my requirement (GO_term_id, Go_term_name, p_value and number of associated genes from my genes list(Genes.txt file)).

@ Micheal and newilfws- Okei I will edit questions again and make it more precise with sample data.

ADD REPLY • link 14.0 years ago by Thaman ★ 3.3k

0

Entering edit mode

@ Daniel- You are right about the confusion, but I have mentioned this in my previous question. I have two options either to do it from GO or KEGG so that's why result also changes according to annotation packages I have been using. I tried to look GOstats and topGO but I din't find extracting specific information in result sections like my requirement (GO_term_id, Go_term_name, p_value and number of associated genes from my genes list(Genes.txt file)).

ADD REPLY • link 14.0 years ago by Thaman ★ 3.3k

0

Entering edit mode

@ Micheal and newilfws- Okei I will edit questions again and make it more precise with sample data.

ADD REPLY • link 14.0 years ago by Thaman ★ 3.3k

0

Entering edit mode

@ Daniel, Michael, neilfws- I have edited my questions in the most precise way. Hope now it's clear

ADD REPLY • link 14.0 years ago by Thaman ★ 3.3k

0

Entering edit mode

It's highly appreciated :)

ADD REPLY • link 14.0 years ago by Michael 55k

0

Entering edit mode

@ Michael, so will you guide me how to solve my probelm now?

ADD REPLY • link 14.0 years ago by Thaman ★ 3.3k

0

Entering edit mode

If Brad hadn't been faster, I would, let us know if that solves your problem.

ADD REPLY • link 14.0 years ago by Michael 55k

Ram · Answer 1 · 2010-11-09

Here is the code, with comments, to produce the output table with gene names. The manipulations are broken down into step by step to try and make them as clear as possible. The two key bits are:

Use the human annotation table to extract Entrez gene IDs linked to your GO identifiers.
Subset your list of selected genes based on that list of Entrez IDs.

This is applied to each row of the table, then pushed back into the final data table for output:

library(GOstats)

#universe <- read.table("Back.txt")
inGenes <- read.table ("genes.txt", sep="\t", header=TRUE)
selected <- unique(inGenes$Entrez_Gene_ID)

param <- new("GOHyperGParams", geneIds=selected,
             #universe=universe,
             annotation="org.Hs.eg.db", ontology="BP",pvalueCutoff=0.1,
             conditional=FALSE, testDirection="over")
hyp <- hyperGTest(param)
sumTable <- summary(hyp)
# subset the output table to get the columns of interest 
# (GO ID, GO description, p-value)
out <- subset(sumTable, select=c(1, 7, 2))
# retrieve input genes associated with each GO identifier
# use the org.Hs.eg data mapping to get GO terms for each ID
goMaps <- lapply(out$GOBPID, function(x) unlist(mget(x, org.Hs.egGO2ALLEGS)))
# subset the selected genes based on those in the mappings
goSelected <- lapply(goMaps, function(x) selected[selected %in% x])
# join together with a semicolon to make up the last column
out$inGenes <- unlist(lapply(goSelected, function(x) paste(x, collapse=";")))
# write the final data table as a tab separated file 
write.table(out, file="go_results.tsv", sep="\t", row.names=FALSE)