A discussion recently arose about how one ought to filter MAPQ in a clinical setting, i.e., where a NGS sample is being processed in order to produce a result for a patient who has an unknown or hypothesised diagnosis. The result could obviously be key.
It was suggested by a friend that MAPQ of 20 would be a sufficient cutoff, whereas, I stated that it ought to be as high as 60. Another colleague implied that my high cutoff didn't make sense because each region of the genome is covered by reads at varying MAPQ and that there would be many over each region, I assume s/he meant, that would have high MAPQ.
Keep in mind that BWA is being used, which produces MAPQ in the range 0-60. Also, I generally drop to as low as MAPQ 40 in clinical pipelines and then rely on a whole bunch of other metrics to ensure that only true variants are called, confirmed with Sanger
For the record: >50% of the genome exhibits a high level of homology and there are certain regions that will simply never attain a MAPQ >30 due to their high level of homology. Look at the CYP genes, for example. Some of the exons of these just cannot be reliably sequenced using the standard NGS protocols. Some reads do map to these highly homologous regions. For example, at MAPQ 60, you may get coverage of around 10 or 20, whereas other less homologous regions may get >1000.
Remember that this is a clinical setting where a result can change a person's life. As the analyst, would you sign your name on a clinical report, a document type that has legal weight, in knowing that you let these low MAPQ reads through?
The second issue of putting too much focus on MAPQ also arose. Of course, there are countless other QC metrics to use, but MAPQ is one of the first and therefore one of the most important. If you get it wrong, a lot of your results may end up being false-positives.
Cheers for any comments!
"As the analyst", I wouldn't only trust the MAPQ (e.g: check mappability, clipping, GC%, poly-X, IGV viz, DEPTH, etc... ) and I would always ask for a good-old sanger sequencing to confirm any suspicious mutation.
Thanks very much for the reply, Pierre. I can only agree with you.
Hi, may I ask which MAPQ score you settled on, please? thank you
In my most recent work, I use MAPQ 60 via the BWA route. However, ultimately, I don't think that it matters too much if you rely also on other metrics for filtering
Thank you for your response. I have read your recent work it is really impressive. Could you please provide information on any other relevant metrics we should consider other than MAPQ? Really sorry for asking too much and thank you
Thanks for the comments. For single nucleotide variants, other key ones:
Variant calling shouldn't be complex, but many tools over the years have made it extremely complex, e.g., GATK, DeepVariant, etc. It remains that, with BWA and SAMtools, our clinical lab produced a clinical workflow with consistent 100% agreement with Sanger sequencing over our panel of genes of interest.
Thanks a lot for your comments! Your insights on the key metrics for single nucleotide variants are really helpful. Really appreciate your thorough response.