Entering edit mode
4.8 years ago
maegsul
▴
170
Hi, I have processed a VCF of ~1000 samples with ~30 variants of interest (SNPs from GWAS). I converted genotype values to 0, 1 and 2 based on number of risk alleles. I have a tab-delimited table as below:
sample rs1 rs2 rs3 rs4 rs5 rs6 rs7 rs8 . . . . . . rs30
sample1 0 0 2 1 1 0 1 2
sample2 1 0 1 1 1 0 1 1
sample3 0 0 2 2 1 0 0 2
sample4 0 0 1 1 0 0 0 1
sample5 1 0 1 1 1 0 0 1
sample6 0 0 0 2 0 0 1 1
sample7 0 0 1 0 2 0 0 2
sample8 0 0 1 1 0 0 2 0
sample9 1 0 1 1 1 0 0 1
.
.
.
.
sample1000
I am looking for a way to randomly choose a group of samples with enough genotype variety for a follow-up experiment. For instance, I would like to print 11 lines (=samples) that is following this combination criteria: n=3 0 genotype, n=5 1 genotype, n=3 2 genotype for each rsIDs, optimally.
Is there an easy way to do this? Thanks in advance!