Hi all,
I have a set of DNA samples from Y plants in a given geographic area. I'm going to be doing RADseq on individuals in this population (and a number of other, separate populations), however do to financial constraints I'm unable to perform sequencing on all Y individuals. I've decided that I can afford to sequence k (out of Y) in this particular area.
I'd like to select the k samples which are farthest apart/most geographically distributed within the Y samples I collected. Samples that are taken from plants in close proximity to each other are more likely to be closely related, and I'd like to sequence what are ultimately the most genetically diverse samples for my later analysis.
So, the question is: given a set of Y points/samples in 2d geographic space (lat/long coordinates), how do I select the k points that are most geographically distributed/distant from each other?
As I've explored this a bit, I think the problem I'm really having is how to define 'distance' or 'most distributed'. Some of the metrics I've though of (e.g. maximum average distance between the k points) result in really unintuitive point selection in certain cases (for example, if two points are right next to each other but far away from another cluster of all the other points, the two points will be included even if they're almost on top of each other).
I have a feeling that this is not an uncommon problem, and there must be good answers out there. I'll add that I have access to a large cluster and I can brute force the problem to some extent!
Thanks in advance!
Hello! You could try cluster your into k clusters using e.g. K-means and select a centroid from each cluster as representative individual. There exists a variety of clustering indices (e.g. silhouette index) to check if the resulting clusters are well-separated