With a small simulation, as far as I can tell the meanings are:
60 - uniquely mapped read, regardless of number of mismatches / indels
1 - multiply mapped, perfect match or few mismatches / indels
0 - unmapped, or multiply mapped and with lots of mismatches / indels
Sadly, I don't know how many are "few" and "lots", but nonetheless, multiply mapped reads have either MAPQ 0 or 1, whereas uniquely mapped reads have MAPQ of 60.
Looking quickly at the code is more confusing. Apparently, the relevant bits are in unique.h
and unique.cpp
. There are three versions of a MAPQ algorithm. The code seems to be inherited from Bowtie2, and there are several passages with values like 44, 42, 40, 24, 23, and so on. For example, on unique.cpp
:
// There is no valid second-best alignment and the best alignment has a
// perfect score.
const TMapq pair_nosec_perf = 44;
Or in unique.h
, on the V2 of the MAPQ algorithm:
// End-to-end alignment
if(!hasSecbest) {
if (bestOver >= diff * (double)0.8f) ret = 42;
else if(bestOver >= diff * (double)0.7f) ret = 40;
else if(bestOver >= diff * (double)0.6f) ret = 24;
else if(bestOver >= diff * (double)0.5f) ret = 23;
else if(bestOver >= diff * (double)0.4f) ret = 8;
else if(bestOver >= diff * (double)0.3f) ret = 3;
else ret = 0;
So, at first glance, like for Bowtie2 ( How does bowtie2 assign MAPQ scores? ), mismatches would affect MAPQ. However, this is not what happens, and my C++ skills have been nearly exhausted with this exercise.
edit: I am not the first one to have been fooled by the code, see MAPQ values are really useful but their implementation is a mess.
Thanks a lot guys..you both seem to be right about this. Reads which have a MAPQ of 60 will also have NH tag (no. of hits) set as 1. If someone is interested in only uniquely mapped reads and is using HISAT v2.0.4, MAPQ can be used to get such reads or else only use NH tag to get those reads (different aligners use different scoring scheme, hence the NH tag is useful).
Although summarising the shared articles, to be really frank, there should not be anything called as a uniquely mapped read because the genome will have repeats and introducing one or more mismatches will allow the reads to be mapped elsewhere throwing the concept of 'uniqueness' to the bin..only one thing can be true, higher the MAPQ, the aligner is sure that the read has come from the assigned position (because its a probability)..One should choose an appropriate threshold keeping the error rate in mind.
Although I followed this approach for several time, I realized that it is not the best one. I found at least one instance in which these fields provide wrong results Here the results I got with
Here instead what I got with the command
which also reports secondary alignments
While trying to understand the issue I came across this post, in which I think macspider gave the right suggestion, as also reported on hisat2 manual, i.e. that uniquely mapping reads are those NOT having the ZS:i field. So, I now changed my way of determining the number of uniquely mapping reads. I am very anxious to hear other opinions.
Please use
ADD COMMENT
orADD REPLY
to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.