I used RSEM to process RNAseq data, which uses bowtie by default. I found many alignment records with inconsistent CIGAR and MD string. For example:
FCC121WACXX:4:1301:11541:99400#TCTTATAT 83 NM_004055 4307 100 87M3I = 4202 -195 TAGGCTTCCCTCTTCTCAGGATCCACCACAGGGTTAGGGGACAGGAAGCCTGTTCTATTCTCAATAAATCTTACAAAATTCCAAAAAGAC BBBBBBB_]``^HZVHZcb\VG``Vc_V_UWWQ_c^^^OX_ddcccc_c`c^ed^Id_daSbecec`_d`bdd[QQJQJR^caca^c\^^ XA:i:2 MD:Z:87A1A0 NM:i:2 ZW:f:1
The MD string says: 87A1A0
, which should correspond to a CIGAR string with "90M
". But bowtie gives: 87M3I
. It says there is a 3 bps insert in the reference, which is wrong. Anyone encounter this problem? How can you generate a correct CIGAR string? Thanks.
Thanks. Is there any tool to generate a correct CIGAR string?
You already have two correct CIGAR strings ;-) , why do you need a third? The first tells you why bowtie picked this position rather than other possible positions. The second tells you what actually was found there once it looked more closely. The only question is which one do you want to use.
Also remember that the process is a heuristic, the vast majority of times it works very well, with occasional misses. But that neither of your CIGAR strings is guaranteed to be the correct in the terms of being the best possible alignment of the read. The only issue to decide whether this problem is common in your data or rare. If it is common you will need to use a different aligner.