My last question has led me to assumption that the CIGAR string in SAM/BAM files is possibly not very well-defined. Summarized: you cannot calculate a string-difference (e.g. Levenshtein distance) from a CIGAR string and therefore, the sequence-similarity within the aligned region cannot be computed.
The reason for this is quite trivial, CIGAR doesn't differentiate matches and mismatches:
According to the SAM format specification the M
character in a CIGAR
M alignment match (can be a sequence match or mismatch)
refers to the aligned region not to a match (identical base), such that for example 10M
could mean 10 matches, 9 matches + 1 mismatch, 8 matches+2mismatches, etc.
In my humble opinion, this renders the CIGAR pretty much useless to represent an alignment. To address this, it seems that the MD=
tags have been introduced, but they just make the whole thing more complex and cumbersome.
I don't know how this
could have been overlooked in the design, or if it was done on purpose to keep the string compact. Anyway, I see this as a design flaw, that should be corrected. To do that in the definition is easy, let M
denote matched positions only, while X
(which is already in the definition) must be used to denote
mismatches, such that 10M = 10 matches, 9M1X = 9M followed by 1 mismatch, 5M1X4M, 5 matches, 1 mismatch, 4 matches, and so on.
Are you with me in this?
If M stood for gender it would mean "non-hermaphrodite". You would have to look up male/female in the MD field.
This has certainly been a pet peeve of mine!
Hear hear. The fact that you need to look in two optional fields in order to recreate an alignment is really evil.
Yes, it is evil maasha (not the root of all maybe), I wanted to provoke a discussion :) Let's say, it makes things more complex and error prone, and these fields don't always exist. As you saw in my related question it didn't work for me.
do it for science!
Ok then, now the real work starts ;) This of course would have large implications on all software packages using this format.
Jeremy, can you explain this a bit ???
@Michael I think that's just a joke on the definition of M meaning match or mismatch