Hello,
Suppose I have a VCF file with millions of variants (rows) and thousands of samples (columns). I am looking for a way to select a subset of N samples such that they will maximize, or best represent, the variation in the entire population. There is probably more accurate terminology for that, but unfortunately I'm not familiar with it. The idea is to choose a subset of samples for further sequencing/analysis.
Are there any standard tools or methods for doing this type of subsetting analysis? If not, can you help in determining the best way to go about it? I can think of several (probably naive) ways to achieve this:
- Use some kind of K-means clustering, where K = N, and choose one representative from each cluster
- Run a PCA analysis and select a representative from the N largest clusters
- Generate a similarity matrix or even phylogeny and somehow use this for choosing representatives
Would like to hear your thoughts about that or any other suggestions.
Thanks a lot!