I have a SAM file with alignments and for each entry alignment, I want to reconstruct the alignment between the reference and the read based on the CIGAR and MD strings. It seems like this should be possible, but this one example bothers me:
SRR037452.3355 0 ENSG00000266658|ENST00000607521 3523 255 16M1I18M * 0 0 CGGGCCGGTCCCCCCCCGCCGGGTCCGCCCCCGGC IIIIIIIIIIIIIII:III/IIII=+IGC,I"/I. NH:i:1 HI:i:1 NM:i:2 MD:Z:33G0
Here, the CIGAR string has an insertion to the reference which messes up the MD string indexing. According to MD string, the read should be only 34 bases long (r=35). My guess is that the alignment is actually this: 16 matches, 1 base that is present in the read, not present in the reference, 17 matches, one mismatch where the read has a "G" and the reference has something else. Is that correct? Are there CIGAR/MD string combinations where reconstructing the alignment would be impossible (i.e. either of the strings is ambiguous)?
For this alignment edit distance is 2. There is a "I" character in CIGAR string and "G" in the end of the MD string. This means there is an insertion in the middle as well a mismatch in the end of the alignment. The insertion showed in the CIGAR string should also be shown in the MD string . So the correct MD string should be something like: 16^C17G0. Which aligner did you use ? It may be possible that some aligner really care about the MD string as they are not widely used by the downstream tools (Just guessing).
According to the sam format specification, the MD tag specifies deletions with '^' but has no operator to specify insertions. Both the CIGAR and the MD tag are correct.
"The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string '10A5^AC6' means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string." -The sam format specification.
I never fully recognized this until today:
So neither the CIGAR nor the MD string is an unambiguous representation of the alignment, moreover one can only fully reconstruct the alignment by writing code of nontrivial complexity that needs to parse and combine information from the CIGAR string, the MD string and the sequence ... well that's just ridiculous.
"^" in MD specifies deletion in reference which is same as insertion in the aligned read. So here a 1 bp insertion in the read (non-reference) corresponds to "^C" in the MD string. PS: I haven't had my after lunch coffee yet and I may be mixing up things :-)
The wording is somewhat ambiguous, but I think you're backwards here. Here are some example reads aligned with BWA MEM.
With insertions:
With deletions:
You are correct. My answer was purely based on the documentation in the SAM specification format. This line is kind of confusing "the next 5 reference bases are matches followed by a 2bp deletion from the reference". Anyways, thanks for correcting me and taking out time to post those alignments as an example.
So could the CIGAR and MD string be used to get percent identity of a reference region from the sam/bam file?
Ashutosh, I used the STAR aligner (it's super fast).