Question

Is it good practice to convert lowest quality genotyes to missing?

0

Entering edit mode

8.6 years ago

William ★ 5.3k

If you look at a genotype quality distribution from a VCF then you often see a small peak of genotypes with zero quality.

This means that multiple genotypes (HOM_REF, HET, HOM_ALT) are just as likely and one has been chosen more or less at random.

Are downstream applications (tools / libraries) in general tolerant for low quality genotypes, do they understand the difference between a qual 100 and a qual 0 genotype? Do they make use of this genotype quality information?

Or does it make more sense to set the qual 0 genotypes to missing for some downstream purposes? As you don't have any conclusive data for that genotype.

And just have the downstream tool / application work with high quality genotypes?

vcf qc genotypes • 1.8k views

ADD COMMENT • link updated 8.6 years ago by Zev.Kronenberg 12k • written 8.6 years ago by William ★ 5.3k

3

Entering edit mode

It would be so nice if there were a simple answer to this question! The answer depends very much on what you want to do with the data. For example, some analyses are very sensitive to missing heterozygotes while others do not mind at all. Sometimes you want to have the full genome represented, or instead just a very reliable, much smaller, set of variants. The answer depends on what are you going to do with the data, and you usually have to try and compare different combinations.

Just to add that there are also tools like ANGSD that consider uncertainty within a probabilistic framework - they don't rely on actual calls/no-calls.

ADD REPLY • link 8.6 years ago by abascalfederico ★ 1.2k

score 2 · Accepted Answer · 2016-03-31

There are a growing number of tools that take the genotype likelihoods into account, as abascalfederico points out.

Another suggestion is to use phasing software like BEAGLE or SHAPEIT. These programs will fix genotyping errors by leveraging haplotypic and population level data. After phasing a low number of genotypes are usually changed.

Here are the genotypes I do not use:

In Heng's LCR
Gaps, centromeric, telomeric
WGAC (segmental duplications)
Sites that fail VQSR.

If you're interested I have the bed file in HG19 coordinates for 1-3.