Question

Two confusion in bam format

0

Entering edit mode

7.2 years ago

CY ▴ 750

I realized that "M" CIGAR doesn't mean no mismatch. For example, 150M doesn't mean perfect match. It could contain mismatch which is specified in MD tag.

In the SAM SPEC, 'M' means alignment match (can be a sequence match or mismatch)

My confusion is:

1) If 'M' means either alignment with mismatch or without match, I feel it is kind of redundant. MD tag alone can provide more detailed information regarding where the mismatch is.

2) When variant caller accept bam as input, which information does caller take to make the call? My understanding is MD tag (tell where the variant is), base call quality and maybe AS tag (use base quality and mapping quality to estimate the variant call confidence). Am I understand right? Does caller take anything else from bam when calling the variant?

Thanks a lot for any comments:)

bam alignment sam • 2.1k views

ADD COMMENT • link updated 7.2 years ago by Devon Ryan 104k • written 7.2 years ago by CY ▴ 750

score 1 · Answer 1 · 2017-09-28

1

Entering edit mode

7.2 years ago

Devon Ryan 104k

All auxiliary tags are optional, there needn't be an MD tag at all. I don't know of any tools actually using the MD tag these days, so I wouldn't be surprised if it goes the way of the dodo.
Alignment sequence, base qualities, MAPQ, and maybe the CIGAR string.

ADD COMMENT • link 7.2 years ago by Devon Ryan 104k

2

Entering edit mode

BBMap tools that process sam/bam files and do anything related to mismatches (like calling variants or quality analysis) require either:

1) cigar strings in the modern format (with = and X for match and mismatch)
or
2) MD tags.

With the old 'M'-style cigar strings there's no way to tell if a base is a match or mismatch by examining the line alone, which kind of defeats the purpose of creating a file format for aligned sequences. I'd be happy to see MD tags disappear, but only if 'M'-style cigar strings disappear at the same time.

Additionally, variant-callers tend to use flag bits, particularly concerning whether the read is properly paired, and possibly the pnext or tlen fields to consider the observed insert size.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

I can't believe I forgot flags and pnext/tlen!

ADD REPLY • link 7.2 years ago by Devon Ryan 104k

1

Entering edit mode

Thanks for explaining. But different aligners set MAPQ value differently. For example bwa set MAPQ 255 as quality not available while STAR set 255 as uniquely mapped. How variant caller determine MAPQ value then?

ADD REPLY • link 7.2 years ago by CY ▴ 750

2

Entering edit mode

You variant call DNA sequencing aligned with tools such as BWA, however RNA is a quantification experiment, not intended for genotyping. If you want to use RNA sequencing aligned with STAR to perform genotyping then you'd need to use GATK's Split'N'Trim tool to fix the MQ255 issue (amongst other caveats) - Here's a link to their attempt at performing genotyping on RNA sequencing data.

ADD REPLY • link 7.2 years ago by andrew.j.skelton73 6.6k

1

Entering edit mode

As andrew.j.skelton73 mentioned, RNAseq aligners have their own set of issues. Variant callers mostly need to know that "higher is better", with the possible exception of 255. So they can then use that in setting a priori expectations.

ADD REPLY • link 7.2 years ago by Devon Ryan 104k