Can you assume variants not in VCF are all monomorphic for the reference allele?
2
1
Entering edit mode
6.8 years ago
wangdavid758 ▴ 30

I have a VCF file, lets call it file A, which was created by performing variant calling using GATK on whole genome sequencing data. For SNPs that do no appear in the VCF, can I assume that the SNP is monomorphic for the reference allele (i.e all 0 encoding). Note that sites that are monomorphic for the alternate allele appear in the file. I ask because I need to merge this file (A) with another VCF (file B) and I'm not sure how to handle the variants that appear in B but not A. Can I just fill in the genotypes with all 0 for these variants that do no appear in A, or do I have to impute first to account for the possibility of low coverage in the region? Would be great if answers provide a external source as well so I have to point of reference when I consult my advisor because he thinks that sequencing data should never require imputation. Thanks!

VCF Variant Calling Imputation Missing Variants • 4.3k views
ADD COMMENT
5
Entering edit mode
6.8 years ago

I wrote a tool to fix this problem : http://lindenb.github.io/jvarkit/FixVcfMissingGenotypes.html

(but it is slow)

After a VCF-merge, read a VCF, look back at some BAMS to tells if the missing genotypes were homozygotes-ref or not-called. If the number of reads is greater than min.depth, then the missing genotype is said hom-ref.

see also: Back-filling missing genotypes in merged VCF How to get sequencing depths from VCF with Rsamtools Call missing variants in VCF as reference allele

ADD COMMENT
1
Entering edit mode

Nice program, Pierre

ADD REPLY
1
Entering edit mode

It's worth noting that assuming hom-ref if coverage exceeds some threshold is a reasonable heuristic in most cases, the real situations can be more subtle. For example, the alignments at a site may have predominantly low MAPQ, or alternatively the alignments may contain enough mismatches that a confident hom-ref call is also inappropriate.

ADD REPLY
1
Entering edit mode

the real situations can be more subtle.

definitely

may have predominantly low MAPQ,

my tool can filter the reads on MAPQ or NM tag

ADD REPLY
0
Entering edit mode

Great tool. I'm wondering if you've thought about its utility for other libraries beyond WGS. Specifically 10X linked libraries. It's hard to find a tool that generates a multi individual vcf for these data. Most just call variants against the reference making it hard to combine across individuals - is one missing a call because it's homozygous reference or there are no data? Your tool could help but I wanted to see what you thought about its utility for these data.

ADD REPLY
1
Entering edit mode
6.8 years ago
JC 13k

Definitely to be sure you need to check for coverage across all regions, if coverage is good enough and variant calling will have no problems, you can assign the position is reference homogenic.

ADD COMMENT

Login before adding your answer.

Traffic: 2243 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6