Question

Getting number of mismatches from bam record

1

Entering edit mode

10.3 years ago

noah ▴ 10

I'm trying to estimate the sequencing error rates (mismatches and indels) from a bam file.

I can get the number and length of insertions and deletions from the cigar string by counting the number and length of the "I" and "D" values.

How do I calculate the number of mismatches without going to the reference fasta file? We can assume the MD tag is present, but I haven't figured out how to actually parse it properly.

(I'm using python with pysam, if someone has example code somewhere.)

python bam • 5.0k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by noah ▴ 10

Ram · Answer 1 · 2015-02-09

2

Entering edit mode

10.3 years ago

Zev.Kronenberg 12k

Simply use the NM tag.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

This is, I believe, the edit distance, and therefore dependent on the scoring scheme used by the aligner. Is there some way to generally back out the number of mismatches from this? Also, bwa appears to include only mismatches, but bowtie includes insertions and deletions in its NM.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by noah ▴ 10