Cigar And Md String Do Not Match In Bowtie 0.12.8
2
1
Entering edit mode
12.2 years ago
jeremy ▴ 80

I used RSEM to process RNAseq data, which uses bowtie by default. I found many alignment records with inconsistent CIGAR and MD string. For example:

FCC121WACXX:4:1301:11541:99400#TCTTATAT   83   NM_004055   4307   100   87M3I   =   4202   -195   TAGGCTTCCCTCTTCTCAGGATCCACCACAGGGTTAGGGGACAGGAAGCCTGTTCTATTCTCAATAAATCTTACAAAATTCCAAAAAGAC   BBBBBBB_]``^HZVHZcb\VG``Vc_V_UWWQ_c^^^OX_ddcccc_c`c^ed^Id_daSbecec`_d`bdd[QQJQJR^caca^c\^^   XA:i:2   MD:Z:87A1A0   NM:i:2   ZW:f:1

The MD string says: 87A1A0, which should correspond to a CIGAR string with "90M". But bowtie gives: 87M3I. It says there is a 3 bps insert in the reference, which is wrong. Anyone encounter this problem? How can you generate a correct CIGAR string? Thanks.

bowtie cigar • 3.5k views
ADD COMMENT
1
Entering edit mode
12.2 years ago

To be honest each of your CIGAR scores seems a bit strange.

The version of bowtie that you are using does not have the capabilty to align with insertions/deletions, only mismatches are supported. So it seems somewhat surprising that it lists insertions in the CIGAR string, especially at the end of the read where listing mismatches would be more appropriate.

I believe that the main CIGAR string contains the initial fast alignment performed to choose this location as a good hit, whereas the CIGAR listed in the MD string is generated via an optimal Smith–Waterman alignment. So that is the reason for the discrepancy.

The SAM spec says that the two CIGAR strings ought to match I don't know what that really means. Seems a softer requirement than must match.

ADD COMMENT
0
Entering edit mode

Thanks. Is there any tool to generate a correct CIGAR string?

ADD REPLY
1
Entering edit mode

You already have two correct CIGAR strings ;-) , why do you need a third? The first tells you why bowtie picked this position rather than other possible positions. The second tells you what actually was found there once it looked more closely. The only question is which one do you want to use.

Also remember that the process is a heuristic, the vast majority of times it works very well, with occasional misses. But that neither of your CIGAR strings is guaranteed to be the correct in the terms of being the best possible alignment of the read. The only issue to decide whether this problem is common in your data or rare. If it is common you will need to use a different aligner.

ADD REPLY
0
Entering edit mode
6.9 years ago
Rubus Pi • 0

"Note that insertions, since they don't represent a loss of information about the reference, are not stored in MD flag. This has some interested consequences."

https://github.com/vsbuffalo/devnotes/wiki/The-MD-Tag-in-BAM-Files

ADD COMMENT

Login before adding your answer.

Traffic: 1478 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6