Question

Recalculation Of Base Qualities After Realignment

3

Entering edit mode

12.6 years ago

Russh ★ 1.2k

hi

apologies if this is a trivial question.

I've mapped SOLiD reads using BFAST and subsequently realigned the same with SRMA. There is considerably fewer one-off mismatch errors in the SRMA modified data (in fact SRMA looks a superior tool to GATKs local realigner, which I was previously using).

At those bases where SRMA has altered the initial mapping, the base qualities all seem to be set to 1. I'm keen to work out what base quality BFAST would have given these bases, had it originally chosen the SRMA mapping instead of the alignment that it actually chose. I can't find such a tool in SRMA or BFAST, but would like to know if a tool that can do this already exists (doubtless it would be painfully slow if coded by my own hand)

For example, A read mapped in RB1 prior to realignment gives:

6625791961 0 chr13 49055589 255 50M * 0 0

TTTTAGGAAAATCACTTTGTCTAACTCAGACTTATTTTTAAAAAGAAATC

6OWGVWCPNCVK"4NFB:69'"77"%=<>D)'::F-"9ED8%

XA:i:2 MD:Z:30A19 XE:Z:---------------------0-------0---0-----2----1----- PG:Z:bfast IH:i:1 NH:i:1 HI:i:1 CM:i:5 NM:i:1 CQ:Z:%(>>A-1<>@.,;=<)1<>=%%8-0)(%+%%)%%)+(-.&,%,1%%+1*% AS:i:1500 CS:Z:T00003202000321120011203012212012003000020000120032

The same read after realignment gives:

6625791961 0 chr13 49055589 255 50M * 0 0

TTTTAGGAAAATCACTTTGTCTAACTCAGAATTATTTTTAAAAAGAAATC

6OWGVWCPNCVK"4NFB:69'""""%=<>D)'::F-"9ED8%

XC:i:683 XE:Z:---------------------0-------012-0-----2----1----- PG:Z:srma NM:i:0 CQ:Z:%(>>A-1<>@.,;=<)1<>=%%8-0)(%+%%)%%)+(-.&,%,1%%+1*% AS:i:-33 CS:Z:T00003202000321120011203012212012003000020000120032

The MAPQ is the same, but the baseQs around the C/A differ

All the best Russ, Liverpool

• 2.4k views

ADD COMMENT • link updated 12.6 years ago by Arun 2.4k • written 12.6 years ago by Russh ★ 1.2k

0

Entering edit mode

If you look at their paper here: http://genomebiology.com/2010/11/10/R99

If the read base matches the start node base, then no penalty is added to the previous re-alignment score. Otherwise, a negative score based on the original base quality of the read is added to the previous re-alignment score to return the current re-alignment score. Other alignment scoring schemes are possible, but mismatched bases are scored using base quality since it has been shown to improve alignment quality

So maybe they penalize bases that has mismatch while computing realignment score and then retain it as the base also has changed? Just speculating.

ADD REPLY • link 12.6 years ago by Arun 2.4k

score 1 · Answer 1 · 2012-05-16

In short, it is not possible to calculate unless the software has an external tool that allows you to compute it. It would depend on many factors and is important to know what factors are implemented/ considered important by the software that you use and how they implement it. At the least you would have to know the working of the software towards assigning MAPQ for you to then write a script to emulate the same.

Just to give a perspective, from MAPQ page (admittedly on Illumina sequencing, but similar principles would apply I suppose). The calculation of mapping qualities is simple, but this simple calculation considers all the factors below:

1. The repeat structure of the reference. Reads falling in repetitive regions usually get very low mapping quality.
2. The base quality of the read. Low quality means the observed read sequence is possibly wrong, and wrong sequence may lead to a wrong alignment.
3. The sensitivity of the alignment algorithm. The true hit is more likely to be missed by an algorithm with low sensitivity, which also causes mapping errors.
4. Paired end or not. Reads mapped in pairs are more likely to be correct.

When you see a read alignment can get a mapping quality 30, it usually implies:

1. The overall base quality of the read is good.
2. The best alignment has few mismatches.
3. The read has few or just one `good' hit on the reference, which means the current alignment is still the best even if one or two bases are actually mutations or sequencing errors.