Hi.
After statistical analysis (t-test/anova) on microarray data for differential gene expression with respect to a control, I finally have a list with the following information. Gene name , p-value, log fold change.
What i want to do now is to perform a Gene Set Expression Analysis with R. A book that I'm reading says that the first step is to create these gene sets by using Gene Ontology, KEGG or other databases and then run statistics (MLP/KS) to figure out which of these sets are enriched.
My question is how to do this very first step of creating the Gene Sets by having the data I previously mentioned (Gene name , p-value, log fold change). Is there any package function in R that is capable of doing such set unification ?
Thank you.
Typically a GSEA requires a 'background' gene set of all those expressed in the tissue/cells, and a 'differential' gene set, i.e. those results from your analysis (although ANOVA/t-test for DE sounds sketchy, have you tried limma?). You can select a cut-off p-value and log fold change (typically 0.01, 2 are used respectively). There are plenty of packages for GSEA on bioconductor.
Hope that helps,
Bruce.
GSEA does not need a background set, it just need all genes analyzed, and some statistic associated with each gene - typically, log(fold-change).
True, although I still think background sets should be used to limit the sets defining biological processes. If geneX is part of processY, but it isn't found expressed, shouldn't the method be aware of this?
Thank you for your reply.
Why sounds sketchy? I run dunnet's test at anova step and then run an FDR correction in my p-values. I have also another list that p-values have been taken from a t-test , also with FDR adjustments. (I couldn't use limma for DE because i didn't have the CEL files)
Anyway.
Here's what is written in book:
By reading the phrase "examine the set of p-values {pi : i ∈ G} associated with a particular gene set GS to see whether they are, in general smaller in magnitude than the overall set of p-values " i understand that firstly i have to somehow create the gene set.
Let's say that I found only two gene sets in my listed genes, and each one plays a role in a different biological procedure. The next step is to statistically compare these two gene sets against the whole initial list and see determine if any of these two gene sets is enriched in the treatments.
Is this though right or not ? If it is, then how can someone create these first gene sets in R ? If I'm wrong please let me know and if it's easy, post here some additional resources that might help on this subject.
See this response from a leader in the field about t-test/ANOVA/limma issue. 7 years ago but still relevant.
What book is it that you have? I suggest using online resources, and looking on Bioconductor for an appropriate method that makes sense to you.
Thank you. Although I tried to use limma, couldn't make it work because i couldn't create proper expression set objects from my data for some reason.