Getting number of mismatches from bam record
1
1
Entering edit mode
9.8 years ago
noah ▴ 10

I'm trying to estimate the sequencing error rates (mismatches and indels) from a bam file.

I can get the number and length of insertions and deletions from the cigar string by counting the number and length of the "I" and "D" values.

How do I calculate the number of mismatches without going to the reference fasta file? We can assume the MD tag is present, but I haven't figured out how to actually parse it properly.

(I'm using python with pysam, if someone has example code somewhere.)

python bam • 4.8k views
ADD COMMENT
2
Entering edit mode
9.8 years ago

Simply use the NM tag.

ADD COMMENT
0
Entering edit mode

This is, I believe, the edit distance, and therefore dependent on the scoring scheme used by the aligner. Is there some way to generally back out the number of mismatches from this? Also, bwa appears to include only mismatches, but bowtie includes insertions and deletions in its NM.

ADD REPLY

Login before adding your answer.

Traffic: 1681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6