Hi,
I hope this is not a stupid question but I have done hierarchical cluster (euclidian distance matrix + complete linkage method) on a subset number of genes (8000) in my bulkRNAseq samples (40) and I found that 13 samples do not cluster as expected/predicted. I run also a PCA and, in line with the hierarchical cluster, those samples cluster far apart from the others.
Is there any way (or R package) I can identify which genes are responsible for the different cluster without using the PCA (eg., the identification of the loadings)?
Practical example in the dendrogram from this site dendo what makes purple samples( 7-13-16) differ from the red ones but also what makes the red + purple cluster in another brach/arm compared to the blue-green samples?
I guess there are genes that would make all the samples cluster together and genes that are very different so they would make the samples cluster far apart, and this could be potentially observed in terms of macro-differences (red/pruple vs green/blue) or micro-differences (red vs purple).
thank you in advance
Camilla
I wouldn't use Euclidean distance in such a high dimensional space because it's most likely subject to distance concentration. If you nonetheless manage to get a good clustering it probably means you have a strong signal contributed by a limited number of genes. You should be able to identify them by looking at which ones contribute the most to the distances between cluster centres, i.e. rank the genes by the squared differences of their means in each cluster. Alternatively, you could train a classifier to predict membership to each cluster (using cluster membership as given labels) and examine the weights associated with each gene.
that's exactly what I wanted to do but I am not able to figure out how to get this information from
hclust
do you know any link I can look at to understand how to do it? thank you for your answer!To get the clusters, you need to cut the tree generated by hclust, for example with the cutree function.
Hi (again),
I tried with the
cutree
function (cut the dendogram in 5 cluster) but I got only this results no info about which gene contribute most apart that I guess are the samples in cluster 1 to contribute most:Where do I make mistake? here my dataset:
and here my script:
cutree returns a vector of cluster memberships. You then need to extract the data for each cluster with e.g. for cluster 1 df4[clu.k5==1, ]
Is there a specific reason why you do not want to do PCA, it sounds like a good job for PCA. You can also visualize loadings plot using PCATools package. It is pretty easy to make
I did it but I would like to be able to discriminate between differences across all samples (which I can do with "loading" in the PCA) and those between specific groups and I cannot do it with PCA. Like with the PCA in my example before I can find what I called macro-differences (red/pruple vs green/blue) but not micro-differences (red vs purple) without removing samples so without changes the result of the PCA. I don't know if it makes sense