Entering edit mode
9.8 years ago
noah
▴
10
I'm trying to estimate the sequencing error rates (mismatches and indels) from a bam file.
I can get the number and length of insertions and deletions from the cigar string by counting the number and length of the "I" and "D" values.
How do I calculate the number of mismatches without going to the reference fasta file? We can assume the MD tag is present, but I haven't figured out how to actually parse it properly.
(I'm using python with pysam, if someone has example code somewhere.)
This is, I believe, the edit distance, and therefore dependent on the scoring scheme used by the aligner. Is there some way to generally back out the number of mismatches from this? Also, bwa appears to include only mismatches, but bowtie includes insertions and deletions in its NM.