Hi Hao,
Before we talk about the average mismatch quality sum, it would be good to define mismatch quality sum (let's call this mmQS for brevity). This property is associated with each read, not each position, and it is calculated almost exactly how you would expect. I would have guessed that it simply adds up the qualities of all mismatching bases in a given read, and this guess is very close to the truth. The only thing that is special is that adjacent mismatches are combined into a single event, and we define the quality for this combined event to be the minimum quality over the run of mismatching bases.
In your example, read1 has 3 mismatches each with quality two, so the possible values for mmQS are:
- 2 (if all mismatches are adjacent)
- 4 (if 2 mismatches are adjacent)
- 6 (if no mismatches are adjacent)
For read2, we can get mmQS=10 or 20 by similar logic. Read3 of course has mmQS=0 since it has no mismatches.
The "average mismatch quality sum" values in the output are specific not only to a particular position, but also to a particular nucleotide. For a given position P and nucleotide B, the average mmQS is the simple arithmetic mean of the mmQS values of all reads that reported base B at position P.
As you can see, the value in question can vary a bit depending on where the mismatches are and which reads report which bases at a particular position. Since your example is small, I've enumerated the possibilities that can happen for a fixed position and nucleotide combination (P,B) while making no assumption about the adjacency of mismatches below. The possibilities are of the form:
{set of reads reporting base B at position P} => {set of possible values for average mismatch quality sum}
.
- the empty set (not very interesting)
- {read1} => {2/1, 4/1, 6/1}
- {read2} => {10/1, 20/1}
- {read3} => {0} # since read 3 has no mismatches
- {read1, read2} => {12/2, 14/2, 16/2, 22/2, 24/2, 26/2} # { (x+y)/2 | x \in {2,4,6}, y \in {10, 20} }
- {read1, read3} => {2/2, 4/2, 6/2}
- {read2, read3} => {10/2, 20/2}
- {read1, read2, read3} => {12/3, 14/3, 16/3, 22/3, 24/3, 26/3}
In case you would like to check your understanding, a concrete example using imaginary 5bp reads similar to the three you suggested follows. The mismatching bases are assumed to have the same qualities as you specified (2 for read1, 10 for read2), and are given in lowercase to make them easy to spot. Assume that these reads all start at position 1 on some sequence.
ref: ACGTA
read1: AgcTt <- gc are combined into a single event; mmQS = min(2, 2) + 2 = 4
read2: tCGTt <- mismatches are not adjacent, mmQS = 10 + 10 = 20
read3: ACGTA <- mmQS = 0
Resulting average mmQS values:
POS BASE Average.mmQS
1 A 2
1 T 20
2 C 10
2 G 4
3 C 4
3 G 10
4 T 8
5 A 0
5 T 12
Hope this helps.
HI , is there some paper to explain that ?