I just discovered that tophat2 sometimes reports the same alignment for the same read multiple times.
I have some paired end RNA-Seq data that I aligned using tophat version 2b using options -I 2500 -i 30 -r 150
.
I'm working on building data files for testing and needed some examples of read pairs where either Read1 or Read2 map onto the genome multiple times.
I found some pairs where Read2 had two different alignments.
In each case, the corresponding Read1 was reported twice, in two different lines in the BAM file. But both alignments were in exactly the same location.
This image from IGB shows an example: http://transvar.org/~aloraine/MultiMapperPE.png. In the image, reads from the same pair have the same names and reads are color-coded by strand. Each read is labeled with its name. The selected read (outlined in read) has two identical alignments in the data. This second image shows the Selection Info with the various tags and other attributes of the selected reads: http://transvar.org/~aloraine/MultiMapperPE-SelectionInfo.png
I would have thought that if one member of the pair aligned to just one location, then it's alignment would be reported just once in the BAM file, not twice.
Why is tophat reporting the same alignment for the same read multiple times? Is there an option that will force tophat to not report redundant alignments?
For what it's worth, tophat isn't violating the SAM spec there, provided at least that it marks one of the pairs as secondary alignments (I never had it output them). Since pairs represent fragments, it's more intuitive to have alternate alignments include both mates (yes, at least the 5'-most coordinate of the mate is contained in an alignment, so this can be somewhat redundant).
As Devon Ryan points out the flag column is essential here, is the "not primary alignment" (256) bit set?
Thanks - yes. Looks like IGB isn't displaying the bit flag in the Selection Info. So probably we need to add that to the display and also decode it for users.