How I do check whether a read contains substitutions or not. I'm not sure if I should check the flag, the cigar or the sequence.
How I do check whether a read contains substitutions or not. I'm not sure if I should check the flag, the cigar or the sequence.
Typically this is indicated in the NM
auxiliary tag. Since you have C++ in the tags, I'll note that you can retrieve this in htslib via uint8_t *v = bam_aux_get(b, "NM")
and then int64_t NM = bam_aux2i(v);
.
If you are looking exclusively for base substitution for alignments that use M CIGAR operators (as opposed to X and =), then you either have to a) check each base against the reference and calculate mismatches yourself, or b) use the NM tag and adjust for any indels in the alignment.
To add to this answer, A while back a wrote a program, ExpandCigar, that expands the M operator with 'X' or '='. After that, it's easy to extract reads where the CIGAR string contains X.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
That's great, what exactly do I have to check in the NM tag?
I gave you two C commands that use htslib in my original reply.
NM is defined as the edit distance to the reference. Insertions and deletions are included . Most caller do not included soft or hard clipped bases in their calculation (the specs do not define whether they should or not).
True, one would also need to parse the CIGAR string and subtract appropriately.