Entering edit mode
9.2 years ago
gavinmdouglas
▴
10
I have a table of metabolite concentrations for ~100 metabolites in ~300 plant cultivars. Eventually I plan on running a GWAS to link these phenotypes to sequence data from these cultivars. I would also like to combine the metabolites into different clusters in case that increases power. I have run a PCA and simple hierarchical clustering in R, but I'm wondering whether there is another approach I should be using? If anyone has any recommendations they would be appreciated! I haven't been able to find any standard approach in a number of different GWAS papers I have looked at.
Thanks,
Gavin
There are many different clustering algorithms each with their own characteristics in particular regarding what kind of structures they are best able to find. Usually, if there is some structure/pattern in the data, most common algorithms will be able to find it. If you don't see any pattern with hierarchical clustering and/or PCA but you expect that there's structure in the data then the structure doesn't conform to these models.
Note also that the choice of distance/similarity measure is important. For example, Euclidean distance is often useless with vectors of more than ~20 noisy variables because it is subject to distance concentration. Distance concentration is the effect by which, in high dimensional spaces, the farthest and closest neighbours have same distance or put another way, the distance measure tends towards a constant. In particular, if your 100 metabolite concentrations are i.i.d. then Euclidean distance will most likely be useless. How well separated your clusters are will determine whether distance concentration is an issue.
Thanks for your response, I've never heard of distance concentration and will look into it.
Hi Gavin, How did you perform your analysis? I am now in a similar situation and clueless. Any suggestion would be appreciated.
Thanks, Abhishek