Entering edit mode
10 months ago
analyst
▴
50
Hi everyone,
I have a combined vcf file of multiple samples. I have to remove homozygous, heterozygous and missing genotypes and keep only polymorphic genotypes.
Kindly suggest which tool should i use to get desired output.
Thanks a lot!
You can parse the genotype (GT) field in the vcf file, see docs for vcf 4.1 here.
It would be as easy as only taking lines more than 3 different numbers appear in the GT field. You can easily do this in R with the vcfR package. There is also this post from a few years ago that offers other options including one from bcftools, among others.
Thanks dthorbor!
I think this command from the post only removes homozygous alleles not heterozygous alleles. What else should I add to remove heterozygous and missing alleles as well in below command
Thanks a lot!
I'm unsure if there are any tools that do exactly as you want as your use case sounds fairly niche. As stated, it would be easy to parse a VCF if you have a basic grasp of the genotype field and can use R (or the PyVCF package if you use python).
In the previous post, there is another command that keeps sites with at least one nonref allele. Can I increase stringency of the criteria to keep sites with at least 40% polymorphic alleles by replacing 1 with 40?