I realized that "M" CIGAR doesn't mean no mismatch. For example, 150M doesn't mean perfect match. It could contain mismatch which is specified in MD tag.
In the SAM SPEC, 'M' means alignment match (can be a sequence match or mismatch)
My confusion is:
1) If 'M' means either alignment with mismatch or without match, I feel it is kind of redundant. MD tag alone can provide more detailed information regarding where the mismatch is.
2) When variant caller accept bam as input, which information does caller take to make the call? My understanding is MD tag (tell where the variant is), base call quality and maybe AS tag (use base quality and mapping quality to estimate the variant call confidence). Am I understand right? Does caller take anything else from bam when calling the variant?
Thanks a lot for any comments:)
BBMap tools that process sam/bam files and do anything related to mismatches (like calling variants or quality analysis) require either:
With the old 'M'-style cigar strings there's no way to tell if a base is a match or mismatch by examining the line alone, which kind of defeats the purpose of creating a file format for aligned sequences. I'd be happy to see MD tags disappear, but only if 'M'-style cigar strings disappear at the same time.
Additionally, variant-callers tend to use flag bits, particularly concerning whether the read is properly paired, and possibly the pnext or tlen fields to consider the observed insert size.
I can't believe I forgot flags and pnext/tlen!
Thanks for explaining. But different aligners set MAPQ value differently. For example bwa set MAPQ 255 as quality not available while STAR set 255 as uniquely mapped. How variant caller determine MAPQ value then?
You variant call DNA sequencing aligned with tools such as BWA, however RNA is a quantification experiment, not intended for genotyping. If you want to use RNA sequencing aligned with STAR to perform genotyping then you'd need to use GATK's Split'N'Trim tool to fix the MQ255 issue (amongst other caveats) - Here's a link to their attempt at performing genotyping on RNA sequencing data.
As andrew.j.skelton73 mentioned, RNAseq aligners have their own set of issues. Variant callers mostly need to know that "higher is better", with the possible exception of 255. So they can then use that in setting a priori expectations.