Question

How is QUAL score calculated in multigenome VCF file?

0

Entering edit mode

9.3 years ago

MAPK ★ 2.1k

Suppose I have two VCF files with one sample in each and want to merge them using python scripts. I want to know if the QUAL score in these files should be summed up or taken as an averaged for the final QUAL score. Can someone please explain me how is this score differ when merging two different files with different number of samples as well?

vcf • 3.2k views

ADD COMMENT • link updated 9.3 years ago by Len Trigg ★ 1.7k • written 9.3 years ago by MAPK ★ 2.1k

score 2 · Accepted Answer · 2016-05-05

So, in simple terms QUAL is defined as a (scaled etc) probability measure that the site is variant for any of the samples (Note that it is therefore not a good measure for whether a particular sample in a multisample VCF is variant, for that you are probably better off with GQ).

So, say you know the probability that the site is variant for sample A from one VCF P(A), and similarly for the second sample B from the other VCF P(B), what you need for the combined VCF is P(A ∪ B) = P(A) + P(B) - P(A ∩ B)). See that extra term that isn't available? Thats the probability that the site is variant in both samples, and will vary according to the independence of A and B. If the samples are very unrelated you can possibly ignore the term, whereas if the samples are highly related (e.g. same family), it should definitely not be ignored. Variant callers like the RTG pedigree aware callers incorporate this information into their scoring, and can be quite hard to bolt on after the fact -- you'll have to make simplifying assumptions depending on your samples.