Question

HISAT2 MAPQ scoring scheme

1

Entering edit mode

7.7 years ago

prasundutta87 ▴ 670

Hi,

On checking out MAPQ scores of my BAM file, I found that there are only three unique scores-0,1 and 60. What do the scores mean? I have checked their manual and its not mentioned there. (I am using HISAT2 version 2.0.4), where the highest score has been reduced from 255 to 60.

alignment • 9.4k views

ADD COMMENT • link 6.4 years ago by prasundutta87 ▴ 670

1

Entering edit mode

Thanks a lot guys..you both seem to be right about this. Reads which have a MAPQ of 60 will also have NH tag (no. of hits) set as 1. If someone is interested in only uniquely mapped reads and is using HISAT v2.0.4, MAPQ can be used to get such reads or else only use NH tag to get those reads (different aligners use different scoring scheme, hence the NH tag is useful).

Although summarising the shared articles, to be really frank, there should not be anything called as a uniquely mapped read because the genome will have repeats and introducing one or more mismatches will allow the reads to be mapped elsewhere throwing the concept of 'uniqueness' to the bin..only one thing can be true, higher the MAPQ, the aligner is sure that the read has come from the assigned position (because its a probability)..One should choose an appropriate threshold keeping the error rate in mind.

ADD REPLY • link 5.9 years ago by prasundutta87 ▴ 670

1

Entering edit mode

Although I followed this approach for several time, I realized that it is not the best one. I found at least one instance in which these fields provide wrong results Here the results I got with

hisat2 reference --no-unal -1 small_LM3_R1.fastq -2 small_LM3_R2.fastq -S small_LM3.sam

HWI-H201:209:C231LACXX:2:1101:18921:2359    89  TRINITY_DN25462_c2_g10_i2   106 60  1S18M161N82M    =   106 0   TTCTTCTTCTCTTCCAATGGAATTGTGACTGCCACAATGGAGGAAAACATAATAAGGCCTGAGCAGGTAACCAAGAAGATTGTACAACGACAGGTGCGGGG   39CCA?CCDDDCCCCDDEDCDEEEECFFFFFGGHHHGIIJIGHIJJHJIHFIIJJJIJJJIJJJIJIJJIJJJJJJIJJJIIIIJJJJHHHHHFFFFFCCC   AS:i:-1 ZS:i:-1 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100    YT:Z:UP XS:A:+  NH:i:1

Here instead what I got with the command

 hisat2 reference --no-unal -1 small_LM3_R1.fastq -2 small_LM3_R2.fastq --secondary -S small_sec_LM3.sam

which also reports secondary alignments

HWI-H201:209:C231LACXX:2:1101:18921:2359    89  TRINITY_DN25462_c2_g10_i2   106 60  1S18M161N82M    =   106 0   TTCTTCTTCTCTTCCAATGGAATTGTGACTGCCACAATGGAGGAAAACATAATAAGGCCTGAGCAGGTAACCAAGAAGATTGTACAACGACAGGTGCGGGG   39CCA?CCDDDCCCCDDEDCDEEEECFFFFFGGHHHGIIJIGHIJJHJIHFIIJJJIJJJIJJJIJIJJIJJJJJJIJJJIIIIJJJJHHHHHFFFFFCCC   AS:i:-1 ZS:i:-1 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100    YT:Z:UP XS:A:+  NH:i:4
HWI-H201:209:C231LACXX:2:1101:18921:2359    345 TRINITY_DN25462_c2_g10_i1   106 60  1S100M  =   106 0   TTCTTCTTCTCTTCCAATGGAATTGTGACTGCCACAATGGAGGAAAACATAATAAGGCCTGAGCAGGTAACCAAGAAGATTGTACAACGACAGGTGCGGGG   39CCA?CCDDDCCCCDDEDCDEEEECFFFFFGGHHHGIIJIGHIJJHJIHFIIJJJIJJJIJJJIJIJJIJJJJJJIJJJIIIIJJJJHHHHHFFFFFCCC   AS:i:-1 ZS:i:-1 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100    YT:Z:UP NH:i:4
HWI-H201:209:C231LACXX:2:1101:18921:2359    345 TRINITY_DN25462_c2_g10_i2   105 60  19M161N82M  =   105 0   TTCTTCTTCTCTTCCAATGGAATTGTGACTGCCACAATGGAGGAAAACATAATAAGGCCTGAGCAGGTAACCAAGAAGATTGTACAACGACAGGTGCGGGG   39CCA?CCDDDCCCCDDEDCDEEEECFFFFFGGHHHGIIJIGHIJJHJIHFIIJJJIJJJIJJJIJIJJIJJJJJJIJJJIIIIJJJJHHHHHFFFFFCCC   AS:i:-3 ZS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:0A100  YT:Z:UP XS:A:+  NH:i:4
HWI-H201:209:C231LACXX:2:1101:18921:2359    345 TRINITY_DN25462_c2_g10_i1   105 60  101M    =   105 0   TTCTTCTTCTCTTCCAATGGAATTGTGACTGCCACAATGGAGGAAAACATAATAAGGCCTGAGCAGGTAACCAAGAAGATTGTACAACGACAGGTGCGGGG   39CCA?CCDDDCCCCDDEDCDEEEECFFFFFGGHHHGIIJIGHIJJHJIHFIIJJJIJJJIJJJIJIJJIJJJJJJIJJJIIIIJJJJHHHHHFFFFFCCC   AS:i:-3 ZS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:0A100  YT:Z:UP NH:i:4

While trying to understand the issue I came across this post, in which I think macspider gave the right suggestion, as also reported on hisat2 manual, i.e. that uniquely mapping reads are those NOT having the ZS:i field. So, I now changed my way of determining the number of uniquely mapping reads. I am very anxious to hear other opinions.

ADD REPLY • link 5.1 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLY • link 6.4 years ago by WouterDeCoster 47k

score 6 · Answer 1 · 2018-07-19

With a small simulation, as far as I can tell the meanings are:

60 - uniquely mapped read, regardless of number of mismatches / indels
 1 - multiply mapped, perfect match or few mismatches / indels
 0 - unmapped, or multiply mapped and with lots of mismatches / indels

Sadly, I don't know how many are "few" and "lots", but nonetheless, multiply mapped reads have either MAPQ 0 or 1, whereas uniquely mapped reads have MAPQ of 60.

Looking quickly at the code is more confusing. Apparently, the relevant bits are in unique.h and unique.cpp. There are three versions of a MAPQ algorithm. The code seems to be inherited from Bowtie2, and there are several passages with values like 44, 42, 40, 24, 23, and so on. For example, on unique.cpp:

// There is no valid second-best alignment and the best alignment has a
// perfect score.
const TMapq pair_nosec_perf = 44;

Or in unique.h, on the V2 of the MAPQ algorithm:

// End-to-end alignment
if(!hasSecbest) {
    if     (bestOver >= diff * (double)0.8f) ret = 42;
    else if(bestOver >= diff * (double)0.7f) ret = 40;
    else if(bestOver >= diff * (double)0.6f) ret = 24;
    else if(bestOver >= diff * (double)0.5f) ret = 23;
    else if(bestOver >= diff * (double)0.4f) ret = 8;
    else if(bestOver >= diff * (double)0.3f) ret = 3;
    else ret = 0;

So, at first glance, like for Bowtie2 ( How does bowtie2 assign MAPQ scores? ), mismatches would affect MAPQ. However, this is not what happens, and my C++ skills have been nearly exhausted with this exercise.

edit: I am not the first one to have been fooled by the code, see MAPQ values are really useful but their implementation is a mess.

score 2 · Answer 2 · 2018-07-18

Looking in the hisat2 source code, hisat_bp.cpp, line 778:

 Reporting:" << endl
        << "  (default)          look for multiple alignments, report best, with MAPQ" << endl
        << "   OR" << endl
        << "  -k <int>           report up to <int> alns per read; MAPQ not meaningful" << endl
        << "   OR" << endl
        << "  -a/--all           report all alignments; very slow, MAPQ not meaningful" << endl

The default reporting mode is "-k 5", which according to the comments in the code the mapq value is not meaningful. The comment in the code indicates that reporting the best alignment is the default reporting mode, but that doesn't match what the manual says.