Given a set of genes - does anybody have a simple suggestion for clustering such a set on the basis of GO terms (generally just interested in biological processes)?
I have a very stringently filtered data set and need a preliminary view of the types of biological processes represented in my reduced data set.
It looks like what you want to do is an enrichment analysis for GO terms. In our lab we have developed a tool that allows to do the enrichment analysis in few easy steps. It is called Gitools and you can find it at http://www.gitools.org.
First you would need to download the GO terms genesets (which you can do within the tool) and then run the enrichment with your set of genes and the previously downloaded genesets (or modules in Gitools nomenclature).
You can take a look to the tutorials available in the web to get started, furthermore don't hesitate to contact the authors for any doubt.
Have you thought about performing GOSlim analyses? Here you can find what GOSlim stands for "GO slims are cut-down versions of the GO ontologies containing a subset of the terms in the whole GO. They give a broad overview of the ontology content without the detail of the specific fine grained terms."
There are some GOSlims sets already defined (see link above) but you can always define your own set of GO terms to perform the GOSlim analysis.
In this kind of analysis you start with a set of GO terms and a set of selected terms that we'll call GOSlim set (for example). You then see (browsing the GO Graph) if each of the GO Terms is connected with any term of the GOSlim set. In other words, you translate all the GO terms you have initially into a set of selected (normally of interest) GO terms.
DAVID http://david.abcc.ncifcrf.gov/ can do that, take a list of genes and cluster based on functional GO annotations. There are a lot of other tools there, and you can get quite fine tuned, but that might serve your purposes.
If you actually want to cluster genes based on GO terms you need to calculate the semantic similarity between all pairs and then cluster them. I know you can do this with GOSim http://goo.gl/YvqlL (an R package), and with a little help from one of R's clustering algorithms. Also, the R package GOSemSim http://goo.gl/DXwBS might be useful though I have not used it. You also need to decide what semantic similarity metric to use http://goo.gl/fMQYS (though not all are implemented in those packages). To interpret the results of the clustering, or to just do enrichment analysis, I recommend using the Ontologizer http://goo.gl/6ejVG. It is flexible and allows you to specify the ontology, the population set, the study set, and the annotations themselves. As for the enrichment method I like MGSA http://goo.gl/T1NWl which is also implemented in the Ontologizer.
Usually this is done the other way around, you cluster or sub-select genes by some condition then you look for GO enrichment within the groups. You could first try that on your group, pehaps use MEV to do it.
The main problem (and this may already be solved in some publications that I am not aware of) with clustering directly by GO terms is defining a similarity metric that would properly characterize any two GO terms. Intuitively that just does not seem possible over more distant GO terms.
If what you want to do is indeed enrichment analysis for GO terms you might want to check [?]this question:[?] The GO_Elite approach that I mentioned there is more or less the opposite of the GOSlim approach as it finds the most distant leaves on the GO tree first. The other answers should be of interest as well.
You might want to give a try to SimCT (http://tagc.univ-mrs.fr/SimCT/) which does exactly this: build a tree based on similarities of GO annotations for a set a genes.
c
ADD COMMENT
• link
updated 5.3 years ago by
Ram
44k
•
written 13.9 years ago by
Carl
▴
80
Tutorials are now found here.