I'm familiar with the notion that MAPQ should be theoretically be related to the probability of an "incorrect" alignment (10^(-MAPQ/10)), although this is vaguely-specified enough that aligners in practice tend to actually just use something like the above.
Note that this is different than the BWAMAPQ scoring interpretation, because BWA-MEM gives MAPQ scores in the range [0,60], rather than [0,37] as has been established for BWA.
My experience is that MAPQ in bwa-aln has almost no predictive value, but it is pretty useful in bwa-mem and seems to correlate with the stated definition.
Note that statements such as "maps to 2 locations" have little meaning. Any read can map to (or originate from) any location in any genome. Even if you add "...equally well", that's still subjective. So, ultimately, MAPQ is just the aligner's confidence that the read originated from the stated location, and Tophat apparently does not attempt to implement the specification. Other aligners typically do. No per-aligner decoding table is needed; if the MAPQ is 3 or less, the aligner has roughly 50% or lower confidence that the mapping is correct, implying (but not requiring) that there is at least one other location with similarly high confidence.
I generated synthetic data and mapped it to create a ROC curve a while back, and my recollection was that it looked reasonable. You can do this yourself if you want for any aligner with BBMap's randomreads.sh and samtoroc.sh. It only works with single-ended reads though because of the sam specification's requirement for renaming reads.
From what I can see, the SAM specification document (http://samtools.github.io/hts-specs/SAMv1.pdf) states that the mapping quality can be anything from 0 to (2^8)-1, i.e., 0-255. See the table in section 1.4 of the document.
Regarding the values used by TopHat, it's regrettable that it drifts from the standard specification for MAPQ, but things like this happen in bioinformatics.
Yes, I realize that anything from 0 to 255 is acceptable according to the spec, but the meaning (in terms of number of mapping locations, likely) of the very small subset of that range that's actually used in practice by BWA-MEM is what I'm interested in.
As far as I know, BWA mem assignes 0-60 and 255, with 0 = equally well mapping to > 1 location, 255 = MAPQ not available and 1-60 the transformed probability as you stated. Every aligner uses different MAPQ scoring. Google around a bit, there a plenty of blogs about the MAPQ mess with the common alignment tools.
Note that BWA-MEM is not the same as BWA algorithmically and thus in terms of MAPQs produced, and lack of finding anything on Google® for BWA-MEM specifically led to this question.
Yes BWA mem doesn't go above 60, a value which, if interpreted literally and untransformed, means that the read has been aligned with a 1-in-1 million error rate. Over the genomic scale, I guess that this is still not sufficient.
Lately I have been choosing only uniquely mapping reads, and dealing with the lost coverage. However, validation work we did in clinical genetics shows that variants are best called at 30 read depth (not 100, not 1000, etc)., with a minimum 'comfortable' cutoff of 18.
MAPQ scores are not meaningful because BAM is not meaningful - or rather, the field has yet to define the difference between read-alignments (what BAM officially stores) and fragment-alignments (what most aligners produce).
But attempting to create a common standard is impossible. People want different things out of the MAPQ score, and so there can not be a single standard. Some people want it to represent how well the alignment aligns, others how well the fragment in it's entirety aligns, others a combination of both, sometimes with and sometimes without taking into account how many other possible alignments there might be for the read/fragment. It's a mess, and it all stems from the fact that pretty much every aligner produces non-standard BAMs except BWA, which produces BAMs no one would use if they needed to store more than one alignment per sequencing event.
My experience is that MAPQ in bwa-aln has almost no predictive value, but it is pretty useful in bwa-mem and seems to correlate with the stated definition.
Note that statements such as "maps to 2 locations" have little meaning. Any read can map to (or originate from) any location in any genome. Even if you add "...equally well", that's still subjective. So, ultimately, MAPQ is just the aligner's confidence that the read originated from the stated location, and Tophat apparently does not attempt to implement the specification. Other aligners typically do. No per-aligner decoding table is needed; if the MAPQ is 3 or less, the aligner has roughly 50% or lower confidence that the mapping is correct, implying (but not requiring) that there is at least one other location with similarly high confidence.
Very good point regarding the meaninglessness of "maps to n locations."
Interesting. Do you have any more evidence on this? If so, might be worth a blog post like the above.
I generated synthetic data and mapped it to create a ROC curve a while back, and my recollection was that it looked reasonable. You can do this yourself if you want for any aligner with BBMap's randomreads.sh and samtoroc.sh. It only works with single-ended reads though because of the sam specification's requirement for renaming reads.
From what I can see, the SAM specification document (http://samtools.github.io/hts-specs/SAMv1.pdf) states that the mapping quality can be anything from 0 to (2^8)-1, i.e., 0-255. See the table in section 1.4 of the document.
Regarding the values used by TopHat, it's regrettable that it drifts from the standard specification for MAPQ, but things like this happen in bioinformatics.
Kevin
Yes, I realize that anything from 0 to 255 is acceptable according to the spec, but the meaning (in terms of number of mapping locations, likely) of the very small subset of that range that's actually used in practice by
BWA-MEM
is what I'm interested in.As far as I know, BWA mem assignes 0-60 and 255, with 0 = equally well mapping to > 1 location, 255 = MAPQ not available and 1-60 the transformed probability as you stated. Every aligner uses different MAPQ scoring. Google around a bit, there a plenty of blogs about the MAPQ mess with the common alignment tools.
Note that
BWA-MEM
is not the same asBWA
algorithmically and thus in terms ofMAPQ
s produced, and lack of finding anything on Google® forBWA-MEM
specifically led to this question.Yes BWA mem doesn't go above 60, a value which, if interpreted literally and untransformed, means that the read has been aligned with a 1-in-1 million error rate. Over the genomic scale, I guess that this is still not sufficient.
Lately I have been choosing only uniquely mapping reads, and dealing with the lost coverage. However, validation work we did in clinical genetics shows that variants are best called at 30 read depth (not 100, not 1000, etc)., with a minimum 'comfortable' cutoff of 18.
Kevin
Just quibbling, but the blog post you linked found for Tophat 1.4.1:
And for Tophat2:
...herein lies the utter confusion and madness of how various programs interpret MAPQ. The SAM specification clearly states that
Well, for Bowtie, it isn't available :)
Maybe someone should make an attempt to find a common standard^^
I was think of changing it from a number to just the word "CORRECT" or "WRONG" which seems more straightforward.
but incompatible with every downstream software out there, right?
Ah, you got me there =)