Entering edit mode
7.8 years ago
samuel.lipworth
▴
30
Hi,
I want to find all clusters of a max SNP distance of say 12 snps of 500 samples. I have a data matrix showing the SNP distances but need an algorithm to cluster them - something like hierarchical clustering with a termination at maximum distance of 12 but I'm not sure how to do this in eg R. Any ideas?
Thanks
Could you please give an example of data and output you want to get. Also if you can explain the reason for the question we might be able to find the solution faster.
Sure: a matrix of snp distances between 4 samples eg.
So i can obviously reconstruct the phylogeny using eg ML which would show me that there is a reconstructed snp distance of <12 between samples 3 and 2, and also 4 and 3. Essentially I want to define all clusters where the maximum distance between any cluster member and its nearest neighbour is 12. I could do this by simply looking at the ML tree but this becomes tedious with massive data sets. The reason for doing this is to look for evidence of transmission.
If I get it right, you want to cluster in binary space where distance <12 is considered equally "close" lets use 0 to show it and >=12 is "far" and we can assign 1 to such cases. Then you transform your matrix to
and you want to cluster it then? If so, you can use dist(x, method="binary") in R for distance measure (which is Jaccard), and then use the distance matrix object in a clustering algorithm like hclust. Otherwise you can start with binary clustering with coclusterBinary from https://cran.r-project.org/web/packages/blockcluster/vignettes/blockcluster_tutorial.pdf