Given you have a SAM file and VCF file, is it possible to determine the mapQ score(s) for a particular variant? The overarching goal is to determine if a variant is a false positive because of non-unique mapping.
Thanks in advance.
Given you have a SAM file and VCF file, is it possible to determine the mapQ score(s) for a particular variant? The overarching goal is to determine if a variant is a false positive because of non-unique mapping.
Thanks in advance.
Variants don't have mapping qualities - the MAPQ refers to alignments.
What you could do however is examine the MAPQ of the alignments that do overlap with your variants and tabulate the MAPQ for these alignments. See if there is a pattern to them.
On the other hand, I would say this is a job that should/ought to be done already by the variant caller when producing the genotype likelihood of each allele.
Thanks for your reply, Istvan "What you could do however is examine the MAPQ of the alignments that do overlap with your variants and tabulate the MAPQ for these alignments. See if there is a pattern to them."
This is exactly what I want to accomplish. I am unsure how to go from the variant and determine which SAM reads it originates from.
Any advice would be appreciated.
I'd like to note, though, that some VCF files give an average MAPQ and base quality for a given variant, and some don't. VCF is extensible, so it's not really a format, but more of a format for formats. As such, you never know what might be inside a VCF, and it's never possible to access the information you want without studying the program that generated it. It's like a box of chocolates!
As such - it's difficult to suggest an optimal path without knowing what information you have at your disposal. If you only have VCF files of unknown origin, then no, you cannot determine anything, because the standard spec does not dictate that they contain any useful information.
It would be helpful if you said something like:
1) I received reads from platform A, in run mode B, with library prep C.
2) I preprocessed them using tools DEF with commands GHI.
3) I performed my analysis with tools JKL and commands MNO.
It's not all necessary - you may have stopped at step 1, which is fine. But if you proceeded further, such as to analysis, it's really useful if you explain exactly what you tried and what the results were.
The answer would depend on the scope of the problem and your familiarity with command line tools.
If you only had a few variants at low coverages say 50x, you could visualize the alignments in IGV hover with your mouse on each alignment that indicates a variant and read off the MAPQ. You'd be done in 5 minutes and you'd see if there is anything to it at all - is it a situation that is even worth pursuing.
If you had a lot many more variants or higher coverages you'd need to reach for command line tools and scripting as well:
Hello,
it depends on what variant caller you use. GATK for example outputs the average mapping quality of the reads at the variation site in the vcf. In addition, at heterozygot sites, it performs a mapping quality ranksum test, which is a test wether there is a stastic significant difference between the mapping qualities of reads showing the variant and those having the reference.
freebayes gives you seperate values for the average mapping quality of reads showing the variant and reads showing the reference.
fin swimmer
You could get the mapping qualities at the positions of the variants using mpileup on the SAM and take the mean/median.
Check the XA and XT tags if you generated them to see whether the position had alternate mappings and how close to your reads they were.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
most(all?) variant callers will ignore MAPQ=0
Thank you for the reply Pierre,
I am curious if during the alignment process there are two regions that are highly similar say a gene and it's known pseudogene. My theory is that if I examined the MAPQ scores for these reads, I would be able to determine if the read was originating from the gene or the pseudogene.
thanks!
If a read aligns equally well to two locations, both locations should get the same MAPQ (in this case, 3 or less). So you should not be able to distinguish between a read originating from a gene or pseudogene except in the regions where they differ.
Reads which map equally well on multiple locations (as in the situation you describe) will have a mapping quality of 0 and as Pierre said will be ignored by variant callers.