As best I can tell, this is a new feature arising from bwa mem
's ability to generate chimeric alignments. This is where one read aligns jointly to multiple positions in the reference genome, for example the first half of the read to somewhere on chr1 and the second half to somewhere on chr2. Note that this is different from a multi-mapping read, where the entire read may be mapped multiple places.
To handle the split read case, bwa mem
will generate a separate SAM record (line in the SAM file) for each aligning segment of a read. So if for example the first read of a pair gets split into two mapping segments, you could have three lines in the SAM file from that read pair (say two from the first read, one from the second). I believe it is possible for this to happen and still have all records marked as properly paired, if orientation and insert size constraints are fulfilled (flag 0x2
set in all SAM records). If this happens, you could get the odd numbers you observe.
The SAM spec has evolved to include a new flag, 0x800
, that denotes the supplementary reads (all but the first, defined arbitrarily I think) in a multi-part (chimeric) alignment. I predict that if you first remove reads with the 0x800
flag set and then run flagstat
, you will get an even number for the properly-paired count.
A note for completeness: flagstat
just does very simple counts of how many SAM records have various flag fields set. The flag values depend completely on what the orginating aligner decided to do. Records properly paired
are just those with flag 0x2
, and records with itself and mate mapped
are just those records where neither flag 0x4
nor 0x8
are set. Furthermore, I believe samtools
currently ignores the new 0x800
flag.
By the way, how to check if the samtools ignores the new 0x800 flag? I use samtools 0.1.19-44428cd, thanks!
That is because of chimeric alignment. Heng replied. Thank you for your detailed explanation.