Tophat Produces Non-Standard Bam With Same Read Appearing With Distinct Sequences?
1
0
Entering edit mode
12.2 years ago
user ▴ 950

I recently mapped reads with Tophat to the genome and a GTF file of junctions. I did not restrict the mapping to uniquely mapping reads, so I allow multi-mapping reads.

The BAM file produced by Tophat contained a read with the same ID multiple times, which is expected. However, the read appeared in those multiple places with distinct sequences each time. The following read maps to 12 places, and here are four of those alignments:

HWI-ST333:3:1215:13855:84627#ATCACG    272    chr1    21047875    0    28M    *    0    0    AGAGATTTATACGATCTGAAGAGACACC    e^bhggfgfff^fdcfggfgccaSJ^Z^    AS:i:-12    XM:i:2XO:i:0    XG:i:0    MD:Z:1A22A3    NM:i:2    NH:i:12    CC:Z:=    CP:i:44000382    HI:i:0
HWI-ST333:3:1215:13855:84627#ATCACG    272    chr1    44000382    0    28M    *    0    0    AGAGATTTATACGATCTGAAGAGACACC    e^bhggfgfff^fdcfggfgccaSJ^Z^    AS:i:-12    XM:i:2XO:i:0    XG:i:0    MD:Z:1A22A3    NM:i:2    NH:i:12    CC:Z:=    CP:i:173433260    HI:i:1
HWI-ST333:3:1215:13855:84627#ATCACG    272    chr1    173433260    0    28M    *    0    0    AGAGATTTATACGATCTGAAGAGACACC    e^bhggfgfff^fdcfggfgccaSJ^Z^    AS:i:-12    XM:i:2XO:i:0    XG:i:0    MD:Z:1A22A3    NM:i:2    NH:i:12    CC:Z:chr10    CP:i:83790182    HI:i:2
HWI-ST333:3:1215:13855:84627#ATCACG    256    chr10    83790182    0    28M    *    0    0    GGTGTCTCTTCAGATCGTATAAATCTCT    ^Z^JSaccgfggfcdf^fffgfgghb^e    AS:i:-12    XM:i:2XO:i:0    XG:i:0    MD:Z:3T22T1    NM:i:2    NH:i:12    CC:Z:chr13    CP:i:23895518    HI:i:3

The first three occurrences of the read have the same sequence, but the fourth appearance has a different SEQ field. According to the SAM format, multi-mapping reads can have the SEQ field be '*' after the first alignment is listed to save space, but I cannot see how the very same read can appear with different sequences, as happens here. Is this a violation of the BAM/SAM format? Is it a Tophat error? thanks.

sam tophat bowtie rna-seq mapping • 3.0k views
ADD COMMENT
4
Entering edit mode
12.2 years ago

The first three hits are on the reverse strand whereas the last is on the forward strand. The sequence is reverse complemented to account for that.

ADD COMMENT
0
Entering edit mode

if your bams are sorted, then what is the best way to get the alignments sorted by their quality (as judged by tophat) in the case of a multimapping read?

ADD REPLY
0
Entering edit mode

you should ask the above as a new question - adding it as a comment to an answer will not help with getting it answered.

ADD REPLY

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6