Question

How To Verify That A Subset Of Genes Will Cluster Together

0

Entering edit mode

11.6 years ago

elb83 ▴ 80

Hi guys, I have a question about clustering: I have a gene expression matrix (rows == genes (about 15.000), columns == samples (about 300)) and a list of selected features ( 30 genes) prior selected and included in the list of 15.000 genes. I would like to check if the selected (a priori) features are able to cluster the whole gene expression matrix. Can anyone help me please? I know that there's a huge amount of algorithms to do this, including Hierarchical Clustering but I need a very robust algorithm since I have to classify human samples for a clinical test!

Thanks a lot for your help!

Best,

e.

clustering gene-expression • 3.2k views

ADD COMMENT • link updated 11.6 years ago by Michael 55k • written 11.6 years ago by elb83 ▴ 80

1

Entering edit mode

You are using the word clustering, but what you are asking to do is actually classification. In order to perform classification, you'll need to form a "classifier". For a classifier, you'll need a training dataset that has the same characteristics as any test datasets you'll want to classify later. The training dataset must have the outcome of interest known. In your setting, do you know the outcomes for your 300 samples (or a large subset of them)?

ADD REPLY • link 11.6 years ago by Sean Davis 27k

0

Entering edit mode

Hi! The problem is that I really don't have informations on clinical outcome or other types of classification or classes of my dataset! I would like to know if I can classify or cluster the patients according to the expression of that genes and if the samples cluster in a different way according to the gene expression of that group of genes, I would like to know which other genes show the same differential expression of my features (the a priori defined set of genes)

ADD REPLY • link 11.6 years ago by elb83 ▴ 80

0

Entering edit mode

If you cluster samples using ANY set of genes, you will ALWAYS get clusters. Unfortunately, these clusters do not carry any particular biological meaning unless you know something about your samples. I think you have an idea about what you want to do--this is great. However, it sounds like it needs to be fleshed out. I would encourage you to talk to your collaborators and to seek the council of a local bioinformatics expert who can discuss the dataset and questions you would like to try to answer.

ADD REPLY • link 11.6 years ago by Sean Davis 27k

1

Entering edit mode

So you want to see if a subset of your gene expression data (picked a priori) form a good cluster? There are many statistics on goodness of cluster that measure how heterogenous/homogenous your cluster is. A simple one is just a sum of square within cluster distance. However these measures are meaningless by themselves without taking into consideration the shape of the data.

Perhaps you can treat the problem kind of like an enrichment analysis: how likely is it for a subset with X members to generate a sum of square within distance Y or lower. I guess you can bootstrap that by taking random subsets with X members from your entire dataset and calculating the sum of square within distance.

ADD REPLY • link 11.6 years ago by Damian Kao 16k

0

Entering edit mode

Hi Damian! Thanks a lot for your help! I would like simply to know if I can classify the patients according to the gene expression of that group of genes and if yes, which are the other genes showing a similar expression of that group of genes.

ADD REPLY • link 11.6 years ago by elb83 ▴ 80

0

Entering edit mode

It is not clear what you mean by "I would like to check if the selected (a priori) features are able to cluster the whole gene expression matrix". I suggest you formulate a more precise biological question.

ADD REPLY • link 11.6 years ago by Michael 55k

0

Entering edit mode

Michael, I added some explanations answering to Sean and Damian. If it is not clear enough, I can explain the problem again and better!

ADD REPLY • link 11.6 years ago by elb83 ▴ 80

score 4 · Answer 1 · 2013-05-25

First of all, don't take this personally, no insult intended, I am just trying to help you. Unfortunately, I seem to have a subscription for answering questions, where the approach makes not much sense. Also I understand completely that this is probably not the kind of answer you are hoping for.

I have been trying to find out what you are trying to do, when I finally noticed that there is a more important aspect to the situation you describe in your post, than which clustering algorithm might be superior. And this aspect is the ethical dimension. If it is true what you are depicting in your post (and this is indeed a bit hard to believe) you are trying to devise a clinical test on humans or human data, but you do not understand the methods you are about to use, nor seemingly are able to express a sensible biological question. It is simply unethical in my point of view. So whatever your plans are, STOP! Doing tests on people is not a joke.

Therefore, I urge you to go at least one step back and provide us with all the details of your trial and setup. People here are extremely willing to help, and discuss all aspects of the analysis and biological questions that you might have. Otherwise, chances are that you are doing it just horribly wrong. In particular you must be able to formulate a precise hypothesis which can involve a clinical response variable. I have some guesses about your setting and how to approach your problem. This involves for example that the only way to go is to cluster patients not genes. But before starting to rely on wild guesses, I would like you to improve your question. There are many things that are odd with it, and I will name only a few:

Cluster analysis is really mainly an exploratory analysis. Using it, indeed is a confession telling me "I have no good hypothesis".
The headline of the question is most likely incorrect. I cannot see how this can be relevant for a clinical test, and also it is not consistent with the rest of the question.
There is definitely some information which you are hiding: based on which prior knowledge have the 30 genes been selected? There must be some.
Based on the above it is very hard to believe that there is no clinical outcome related to the study, a response variable must exist somewhere, otherwise a clinical test makes no sense.
you are using the term 'robust', but robustness means something very specific in statistics (robustness against outliers or deviations from prior assumptions, e.g. non-normality). However, unless it is a prior known fact that there are a lot of outliers in the data, accuracy might be valued higher than robustness.
following from this - again - there is no good way to assign an accuracy to clusters; which again demands for the existence of a dependent variable, such that you can do some sort of regression or classification and assess the performance.

In summary, when working with clinical studies, you have to be very conservative with the choice of your statistical methods. You should stick to the commonly accepted standards of analysis, with respect to statistical testing or regression and classification analysis. You should not start to include non-standard statistical settings, where well established analysis is available. Inventing new methods can be done in a methods paper using well established data sets.

I hope it has become clear to you why it is currently difficult to answer your question in a responsible way. Again I apologize if my words sound harsh.