Selecting a Subset of Samples Based on Genotype Variety of Multiple Variants

0

Entering edit mode

5.2 years ago

maegsul ▴ 170

Hi, I have processed a VCF of ~1000 samples with ~30 variants of interest (SNPs from GWAS). I converted genotype values to 0, 1 and 2 based on number of risk alleles. I have a tab-delimited table as below:

sample  rs1 rs2 rs3 rs4 rs5 rs6 rs7 rs8 . . . . . . rs30
sample1 0   0   2   1   1   0   1   2
sample2 1   0   1   1   1   0   1   1
sample3 0   0   2   2   1   0   0   2
sample4 0   0   1   1   0   0   0   1
sample5 1   0   1   1   1   0   0   1
sample6 0   0   0   2   0   0   1   1
sample7 0   0   1   0   2   0   0   2
sample8 0   0   1   1   0   0   2   0
sample9 1   0   1   1   1   0   0   1
.
.
.
.
sample1000

I am looking for a way to randomly choose a group of samples with enough genotype variety for a follow-up experiment. For instance, I would like to print 11 lines (=samples) that is following this combination criteria: n=3 0 genotype, n=5 1 genotype, n=3 2 genotype for each rsIDs, optimally.

Is there an easy way to do this? Thanks in advance!

SNP • 620 views

ADD COMMENT • link 5.2 years ago by maegsul ▴ 170

Login before adding your answer.