Hi,
I am dealing with an issue that I can't wrap my head around, and would greatly appreciate any input on this. I am working on variant calling on a set of primate species from the Cercopithecus genus. I have mapped to two separate reference genomes, one is the rhesus macaque (MMul) which is a fairly distant outgroup, and the other is the Chlorocebus sabaeus (ChlSab) which is much closer. When I follow the gatk best practices workflow, I notice that the "standard" filtration settings in VariantFiltration, specifically the MQ threshold of 40 (MQ40), removes a massive amount of sites in the ChlSab-variants (~45 %), but very few in the MMul-set (~3 %). The distributions of the variants' MQ-score look very different depending on the reference genome (see plots). When I randomly inspect some of these filtered genotypes in IGV, they look fine to my eye. The MQ is calculated from the root mean square mapping quality of the variants, and I'm having a hard time coming with an explanation to this large discrepancy between reference genomes.
Clearly, I cannot simply use the "standard" cutoff at 40, but given the MQ-distribution I would more or less need to remove this filter altogether not to discard too many seemingly good variants. This I'm not very comfortable doing without understanding the cause behind this, though.
Has anybody come across something like this before, or does anybody have any ideas about why this may happen and how to deal with it?
Thanks, Axel