Question

SAM MD tag for consecutive insertions/deletions

1

Entering edit mode

6.8 years ago

marongiu.luigi ▴ 730

Hello,

I am struggling to understand the MD optional tag of the SAM format. Let's say I have the following alignment:

ref ATGC-TTTCCGGC--CC

seq ACG-ATTT--GGATGCC

I understrand the CIGAR for the seq entry should be (I believe): 3M1D1I3M2D3M2I2M but what about the relative MD tag? Deletions in the query sequence are indicated with the caret ^ but the deletions? and would consecutive insertions be treated independently with a series of carets? so that the MD field would be: 1T1^C{unknown}3^C^C3{unknown}{unknown}2?

Thank you

SNP genome alignment sequence • 5.0k views

ADD COMMENT • link 6.8 years ago by marongiu.luigi ▴ 730

score 0 · Answer 1 · 2018-03-07

0

Entering edit mode

6.8 years ago

dariober 15k

I think your cigar string should end with 1M, not 2M and the resulting MD tag should be 1T1^C3^CC2C1. I haven't done it by hand as I haven't figured out the details of the MD syntax myself. Instead, I used Java/htsjdk with the following code, provided that I haven't made any mistake:

@Test
public void test(){
    SAMRecord rec= new SAMRecord(null);
    rec.setCigarString("3M1D1I3M2D3M2I1M");
    rec.setReadBases("ACGATTTGGATGC".getBytes());
    rec.setAlignmentStart(1);
    byte[] ref= "ATGCTTTCCGGCCC".getBytes();
    SequenceUtil.calculateMdAndNmTags(rec, ref, true, true);
    System.err.println(rec.getSAMString());
}

ADD COMMENT • link 6.8 years ago by dariober 15k

0

Entering edit mode

interesting script! anyhow, I made a typo in the seq sequence, it should have ended with a double C (corrected), sorry. If I understand correctly, MD ignores the deletions since they do not affect the reference sequence, whereas insertions get the caret; from what you worte, consecutive insertions are indicated by a single caret, right? Tx

ADD REPLY • link 6.8 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Mmm, not quite. The purpose of the MD tag is to reconstruct the reference sequence given the read sequence, the cigar string and, of course, the MD tag. Insertions in the reference are not part of the reference (by definition!) so they don't need to be encoded in the MD. The caret indicates that there is a deletion in the reference, i.e. there are reference bases missing in the read sequence, so we put these missing bases in the MD. I think you've got your deletion/insertion terminology the other way round. "Insertion" means insertion in the reference and "deletion" means deletions in the reference (not in the read). Hope it makes sense...! Have a look also at this nice explanation https://github.com/vsbuffalo/devnotes/wiki/The-MD-Tag-in-BAM-Files

ADD REPLY • link 6.8 years ago by dariober 15k

0

Entering edit mode

it is very confusing, i have also read the link you sent me but since there are not the original sequences it is difficult for me to follow. since ^ indicates a deletion in the reference, shouldn't the MD of this example have a ^T^G since TG are present in seq but missing in the reference? (the first substitution I indicated can simply be reported as a mutation). So the MD should be 1T1C3 for the first 7 bases. But afterwards? should i simply ignore the following CC that represent a deletion in seq? then there is a 2C then a ^T^G (or simply ^TG) and a final 2?

ADD REPLY • link 6.8 years ago by marongiu.luigi ▴ 730