I have a VCF file, lets call it file A, which was created by performing variant calling using GATK on whole genome sequencing data. For SNPs that do no appear in the VCF, can I assume that the SNP is monomorphic for the reference allele (i.e all 0 encoding). Note that sites that are monomorphic for the alternate allele appear in the file. I ask because I need to merge this file (A) with another VCF (file B) and I'm not sure how to handle the variants that appear in B but not A. Can I just fill in the genotypes with all 0 for these variants that do no appear in A, or do I have to impute first to account for the possibility of low coverage in the region? Would be great if answers provide a external source as well so I have to point of reference when I consult my advisor because he thinks that sequencing data should never require imputation. Thanks!
Nice program, Pierre
It's worth noting that assuming hom-ref if coverage exceeds some threshold is a reasonable heuristic in most cases, the real situations can be more subtle. For example, the alignments at a site may have predominantly low MAPQ, or alternatively the alignments may contain enough mismatches that a confident hom-ref call is also inappropriate.
definitely
my tool can filter the reads on MAPQ or NM tag
Great tool. I'm wondering if you've thought about its utility for other libraries beyond WGS. Specifically 10X linked libraries. It's hard to find a tool that generates a multi individual vcf for these data. Most just call variants against the reference making it hard to combine across individuals - is one missing a call because it's homozygous reference or there are no data? Your tool could help but I wanted to see what you thought about its utility for these data.