Hi,
I've got a table of presence/absence data for a number of genes (around 100) between different samples (< 10) derived from genome sequence data, and have been playing with doing hierarchical clustering to simply illustrate the similarity between each of these strains, borrowing some ideas from microarray analysis.
However, I'm unsure about which distance measure I should utilise to construct the distance matrix for clustering. Currently, I'm using Hamming distance, given the binary nature of the data; but I'm concerned about it not being normalised, or accounting for the joint presence of genes.
Any suggestions for alternatives for this type of data, or recommendations of papers etc I could read for a better understanding of distance metric choice would be much appreciated.
Thanks,
Simon
That is one of the metrics I'm trying to decide between. Any particular reason you chose that method?
It factors in group size.