Hello everybody!
I'm doing a quality control to my HGAP assembly (using one SMRTcell library RSII system, P6C4) of a bacterial genome and to check the coverage distribution of my reads across the assembly I did a mapping using BWA MEM with the flag “-x pacbio”.
I checked the MAPQ distribution in my sam file cutting the fifth column. I found that 91.0% of reads are mapping with a MAPQ = 60 and just 1.0% are mapping with MAPQ = 0. Also reads mapping with MAPQ < 60 are mostly located at the ends, because doing a filtering (using samtools view -q 60) changes are just perceptible at both ends. This is normal because the circular structure of the chromosome generates not real repeated ends. So, I've found in some blogs that MAPQ = 0 indicates that the read maps to multiple locations, then removing MAPQ = 0 mapping reads is affecting mainly those reads mapping at both repeated ends.
I validated those results using a demo library of E. coli k12 RSI, and I had the same results...
My questions are: which is the meaning of MAPQ = 60? Why most of the reads (>90%) are mapping with this quality value? I though that a pacbio library it would generate lower map quality scores having into account that PacBio is a prone error technique...
This is the distribution of my MAPQ scores (first line) vs number of reads mapping with that quality (second line):
(first line) 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
(second line) 118940 130 142 141 145 137 143 125 131 132 162 149 159 144 174 133 164 158 180 152 181 189 167 149 175 160 165 185 171 169 209 183 182 185 165 179 160 210 201 161 182 172 163 175 159 156 144 145 119 109 111 89 79 61 61 59 62 94 175 200 6508