Question

extracting genotypes from a multi-sample VCF that have certain variants

0

Entering edit mode

8.7 years ago

Floydian_slip ▴ 170

Hi, I have a set of variants and a multi-sample merged VCF that indicates the genotype for each sample. Is there a way to extract the sample names that haver those variants? Ideally, I am looking to do this at each variant: variant followed by the names of the samples that have that variant.

Thanks a lot in advance! ~N

vcf genotypes • 2.9k views

ADD COMMENT • link 8.7 years ago by Floydian_slip ▴ 170

0

Entering edit mode

it's not clear to me where you're looking for genotype (sample,A1,A2) and variant (chrom/pos/ref/alts), what are your inputs...

ADD REPLY • link 8.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I have 2 inputs: 1. a vcf file with a set of variants. 2. Another merged VCF file from multiple individuals that indicates for each variant what is the genotype (present, absent, etc).

Now, all the individual may not have the variants from the first file. What I would like to know is which samples have each of the variants from the first file. Eg., variant1 from file1 is present in these samples from file2.
I hope that is clear.

ADD REPLY • link 8.7 years ago by Floydian_slip ▴ 170

1

Entering edit mode

So, I figured out a way: first, I can used betools intersect the two files to get only those lines in the multi-sample merged VCF file that contains the variants that I want information for. Next, from the resultant file, I can easily parse the columns corresponding to the genoptypes of each sample and extract only those column headings (and hence the sample names) that have that variant (0/1 or 1/2 meaning that they have that variant in some form) using awk, cut, etc.

Thanks!

ADD REPLY • link 8.7 years ago by Floydian_slip ▴ 170