Entering edit mode
8.7 years ago
Floydian_slip
▴
170
Hi, I have a set of variants and a multi-sample merged VCF that indicates the genotype for each sample. Is there a way to extract the sample names that haver those variants? Ideally, I am looking to do this at each variant: variant followed by the names of the samples that have that variant.
Thanks a lot in advance! ~N
it's not clear to me where you're looking for genotype (sample,A1,A2) and variant (chrom/pos/ref/alts), what are your inputs...
I have 2 inputs: 1. a vcf file with a set of variants. 2. Another merged VCF file from multiple individuals that indicates for each variant what is the genotype (present, absent, etc).
Now, all the individual may not have the variants from the first file. What I would like to know is which samples have each of the variants from the first file. Eg., variant1 from file1 is present in these samples from file2.
I hope that is clear.
So, I figured out a way: first, I can used betools intersect the two files to get only those lines in the multi-sample merged VCF file that contains the variants that I want information for. Next, from the resultant file, I can easily parse the columns corresponding to the genoptypes of each sample and extract only those column headings (and hence the sample names) that have that variant (0/1 or 1/2 meaning that they have that variant in some form) using awk, cut, etc.
Thanks!