I have a BAM file with with multiple alignments per-read (NH:i:2) of cDNA sequence. It seems like there is a bug in how "samtools sort" and "samtools fixmate" work together in this case …or perhaps I am using these incorrectly?
I see the same thing with the equivalent Picard tools.
Here are my four rows from "samtools view", representing one read pair with two hits per read.
- HWI-ST495_132993521:3:1101:1143:51868 385 16 72090091 3 52M286N48M = 72090091 0 AGCT qqqq RG:Z:2890686892 PG:Z:2890686892 AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU XS:A:+ NH:i:2 CC:Z:= CP:i:72090091 HI:i:0
- HWI-ST495_132993521:3:1101:1143:51868 129 16 72090091 3 52M2010N48M = 72090091 0 AGCT qqqq RG:Z:2890686892 PG:Z:2890686892 AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU XS:A:+ NH:i:2 HI:i:1
- HWI-ST495_132993521:3:1101:1143:51868 369 16 72094060 3 100M = 72094060 0 RG:Z:2890686892 AGCT qqqq PG:Z:2890686892 AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:96G3 YT:Z:UU XS:A:+ NH:i:2 CC:Z:= CP:i:72094060 HI:i:0
- HWI-ST495_132993521:3:1101:1143:51868 113 16 72094060 3 100M = 72094060 0 RG:Z:2890686892 AGCT qqqq PG:Z:2890686892 AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:96G3 YT:Z:UU XS:A:+ NH:i:2 HI:i:1
(The sequence and quality in columns swapped for "AGCT" and "qqqq" because the data is protected, and for readability):
Effectively:
- read 2, hit-index 0
- read 2, hit-index 1
- read 1, hit-index 0
- read 1, hit-index 1
Fixmate has set columns 7 and 8 to the position of the mate for each alignment. It has rows 1 & 2 as a pair, and 3 & 4 as a pair, instead of 1&3 and 2&4.
Looking at the sort code, when name sorting, it sorts by name, then position. So if the top two alignments are different cigar strings at the same position, pairs will not be adjacent.
Looking at the fixmate code, it seems to presume pairs will be adjacent in a "name-sorted" BAM.
It seems that for sort and fixmate to work together properly, either the name-sort option must do further work to ensure pairs are adjacent when NH > 1, such as sort by name then HI (hit index), then position instead of name-then-position. Or that fixmate must be a little smarter and have a window of more than two alignments.
Have others encountered this? Am I missing something standard done when using fixmate with a BAM with multiple alignments?
Thanks in advance! Scott
I undeleted this post. Did you delete it because you answered your own question? Unless you found it was a duplicate it seems like a good one to keep.
I realized that, when emitting multiple hits per read, the secondary alignments were not necessarily paired at all. Some of the premise in that question is flawed.
It would have been better to annotate it as such though. And given Alec's response below it was not as flawed as I originally thought. Thanks for undeleting it...