Question

GATK MQ-scores differ depending on reference genome

0

Entering edit mode

3.0 years ago

axejen ▴ 10

Hi,

I am dealing with an issue that I can't wrap my head around, and would greatly appreciate any input on this. I am working on variant calling on a set of primate species from the Cercopithecus genus. I have mapped to two separate reference genomes, one is the rhesus macaque (MMul) which is a fairly distant outgroup, and the other is the Chlorocebus sabaeus (ChlSab) which is much closer. When I follow the gatk best practices workflow, I notice that the "standard" filtration settings in VariantFiltration, specifically the MQ threshold of 40 (MQ40), removes a massive amount of sites in the ChlSab-variants (~45 %), but very few in the MMul-set (~3 %). The distributions of the variants' MQ-score look very different depending on the reference genome (see plots). When I randomly inspect some of these filtered genotypes in IGV, they look fine to my eye. The MQ is calculated from the root mean square mapping quality of the variants, and I'm having a hard time coming with an explanation to this large discrepancy between reference genomes.

Clearly, I cannot simply use the "standard" cutoff at 40, but given the MQ-distribution I would more or less need to remove this filter altogether not to discard too many seemingly good variants. This I'm not very comfortable doing without understanding the cause behind this, though.

Has anybody come across something like this before, or does anybody have any ideas about why this may happen and how to deal with it?

Thanks, Axel Variants called against the chlorocebus sabaeus reference Variants called against the Macaca mulatta reference

gatk filtration variant vcf • 862 views

ADD COMMENT • link updated 3.0 years ago by lethalfang ▴ 160 • written 3.0 years ago by axejen ▴ 10

score 0 · Answer 1 · 2022-06-01

What's the MQ distribution for all the reads, not just the variants? MQ measures how confident you are that this read comes from this part of the genome and not somewhere else. It's very much a function of the reference genome, i.e., if there are other regions of the genome that's similar to this region, then the MQ for reads mapped to this region will generally be lower.