Selecting samples from a VCF that best represent population variation
0
1
Entering edit mode
4.3 years ago
liorglic ★ 1.5k

Hello,
Suppose I have a VCF file with millions of variants (rows) and thousands of samples (columns). I am looking for a way to select a subset of N samples such that they will maximize, or best represent, the variation in the entire population. There is probably more accurate terminology for that, but unfortunately I'm not familiar with it. The idea is to choose a subset of samples for further sequencing/analysis.
Are there any standard tools or methods for doing this type of subsetting analysis? If not, can you help in determining the best way to go about it? I can think of several (probably naive) ways to achieve this:

  1. Use some kind of K-means clustering, where K = N, and choose one representative from each cluster
  2. Run a PCA analysis and select a representative from the N largest clusters
  3. Generate a similarity matrix or even phylogeny and somehow use this for choosing representatives

Would like to hear your thoughts about that or any other suggestions.
Thanks a lot!

vcf • 773 views
ADD COMMENT

Login before adding your answer.

Traffic: 1804 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6