I recently mapped reads with Tophat to the genome and a GTF file of junctions. I did not restrict the mapping to uniquely mapping reads, so I allow multi-mapping reads.
The BAM file produced by Tophat contained a read with the same ID multiple times, which is expected. However, the read appeared in those multiple places with distinct sequences each time. The following read maps to 12 places, and here are four of those alignments:
HWI-ST333:3:1215:13855:84627#ATCACG 272 chr1 21047875 0 28M * 0 0 AGAGATTTATACGATCTGAAGAGACACC e^bhggfgfff^fdcfggfgccaSJ^Z^ AS:i:-12 XM:i:2XO:i:0 XG:i:0 MD:Z:1A22A3 NM:i:2 NH:i:12 CC:Z:= CP:i:44000382 HI:i:0
HWI-ST333:3:1215:13855:84627#ATCACG 272 chr1 44000382 0 28M * 0 0 AGAGATTTATACGATCTGAAGAGACACC e^bhggfgfff^fdcfggfgccaSJ^Z^ AS:i:-12 XM:i:2XO:i:0 XG:i:0 MD:Z:1A22A3 NM:i:2 NH:i:12 CC:Z:= CP:i:173433260 HI:i:1
HWI-ST333:3:1215:13855:84627#ATCACG 272 chr1 173433260 0 28M * 0 0 AGAGATTTATACGATCTGAAGAGACACC e^bhggfgfff^fdcfggfgccaSJ^Z^ AS:i:-12 XM:i:2XO:i:0 XG:i:0 MD:Z:1A22A3 NM:i:2 NH:i:12 CC:Z:chr10 CP:i:83790182 HI:i:2
HWI-ST333:3:1215:13855:84627#ATCACG 256 chr10 83790182 0 28M * 0 0 GGTGTCTCTTCAGATCGTATAAATCTCT ^Z^JSaccgfggfcdf^fffgfgghb^e AS:i:-12 XM:i:2XO:i:0 XG:i:0 MD:Z:3T22T1 NM:i:2 NH:i:12 CC:Z:chr13 CP:i:23895518 HI:i:3
The first three occurrences of the read have the same sequence, but the fourth appearance has a different SEQ field. According to the SAM format, multi-mapping reads can have the SEQ field be '*'
after the first alignment is listed to save space, but I cannot see how the very same read can appear with different sequences, as happens here. Is this a violation of the BAM/SAM format? Is it a Tophat error? thanks.
if your bams are sorted, then what is the best way to get the alignments sorted by their quality (as judged by tophat) in the case of a multimapping read?
you should ask the above as a new question - adding it as a comment to an answer will not help with getting it answered.