Like this entry from Bowtie mapping result, how can I find how many and where are the mismatches in the alignment?
2358_2039_1969_F3 16 chrX 464352 255 50M * 0 0 ATATCTATATATATGAAAAGATTGCGAACAAAAAAGATGATGGAAAAGGA )?32OUY\`\ZPLU[_]^`]VX^^_[YOQccBBZYb`_bccccBBcccbA XA:i:2 MD:Z:4A45 NM:i:1 CM:i:5
Obviously I can compare the sequence with chrX:464352 of the reference genome, but this is such a pain and takes computation time to do that for millions of reads. Is there a better way?
Thanks.
A CIGAR M means 'matches or mismatches' so you can't tell how many matches there are from '50M' alone
hm, additionally, it's probably good to look at the tags, NM:i:1 indicates there's a mismatch in base-space, as does the MD:Z:4A45 which indicates that the mismatch occurs at the 5th position. (not sure if that's in color or base-space as your reads must be in colorspace).
It's in the SAM format specification.
Thanks, brentp. That's helpful. Where can I find the documentation for those optional fields following the QUAL string? Neither in SamTools document nor in Bowtie manual.
Yes, for most of them, but what about 'XA' tag? And for 'NM', what does it really mean by "Edit distance to the reference, including ambiguous bases but excluding clipping"?