Hello,
I am struggling to understand the MD optional tag of the SAM format. Let's say I have the following alignment:
ref ATGC-TTTCCGGC--CC
seq ACG-ATTT--GGATGCC
I understrand the CIGAR for the seq entry should be (I believe): 3M1D1I3M2D3M2I2M
but what about the relative MD tag? Deletions in the query sequence are indicated with the caret ^
but the deletions? and would consecutive insertions be treated independently with a series of carets? so that the MD field would be: 1T1^C{unknown}3^C^C3{unknown}{unknown}2
?
Thank you
interesting script! anyhow, I made a typo in the seq sequence, it should have ended with a double C (corrected), sorry. If I understand correctly, MD ignores the deletions since they do not affect the reference sequence, whereas insertions get the caret; from what you worte, consecutive insertions are indicated by a single caret, right? Tx
Mmm, not quite. The purpose of the MD tag is to reconstruct the reference sequence given the read sequence, the cigar string and, of course, the MD tag. Insertions in the reference are not part of the reference (by definition!) so they don't need to be encoded in the MD. The caret indicates that there is a deletion in the reference, i.e. there are reference bases missing in the read sequence, so we put these missing bases in the MD. I think you've got your deletion/insertion terminology the other way round. "Insertion" means insertion in the reference and "deletion" means deletions in the reference (not in the read). Hope it makes sense...! Have a look also at this nice explanation https://github.com/vsbuffalo/devnotes/wiki/The-MD-Tag-in-BAM-Files
it is very confusing, i have also read the link you sent me but since there are not the original sequences it is difficult for me to follow. since ^ indicates a deletion in the reference, shouldn't the MD of this example have a ^T^G since TG are present in seq but missing in the reference? (the first substitution I indicated can simply be reported as a mutation). So the MD should be
1T1C3
for the first 7 bases. But afterwards? should i simply ignore the following CC that represent a deletion in seq? then there is a 2C then a ^T^G (or simply ^TG) and a final 2?