Question

Determine fragment alignment overlaps by parsing CIGAR string in the MC tag

1

Entering edit mode

3.4 years ago

abcoxyzide ▴ 20

I have some paired end short read sequencing data. I'd like to know the fragment alignment overlap between the paired reads, and prune that part out for one of the reads to avoid processing the same region twice. For my specific downstream purposes, it has to be done in C.

My current idea is to use the MC tag, which contains the CIGAR string of the mate read, and pass it to bam_cigar2rlen(). Together with core.mpos, I should be able to determine the end position of the mate read.

However, bam_cigar2rlen() takes int n_cigar and const uint32_t *cigar for the arguments, and it is unclear to me how to get these parameters from the MC tag. I am also not sure if parsing the CIGAR string like this is the most efficient way.

I'm a novice to C and I'd highly appreciate any help. Thanks!

C htslib • 1.4k views

ADD COMMENT • link 3.4 years ago by abcoxyzide ▴ 20

score 3 · Accepted Answer · 2022-01-01

3

Entering edit mode

3.4 years ago

John Marshall 3.1k

Unless your data files have another (locally-defined) tagged field giving the mate's end position directly, computing it by parsing the MC tagged field and adding the length returned by bam_cigar2rlen() to MPOS is indeed the best way to compute this.

As you've seen, bam_cigar2rlen() uses HTSlib's already-parsed internal CIGAR representation. Since v1.12, HTSlib has sam_parse_cigar() for parsing a C string (as returned by bam_aux2Z() on your MC field) into the representation you need.

ADD COMMENT • link 3.4 years ago by John Marshall 3.1k

1

Entering edit mode

Thank you so much, it worked wonderfully! I wasn't aware of the bam_aux2Z() function and was messing around with bam_aux_get_str(), which caused a few unexpected behaviors.

ADD REPLY • link 3.4 years ago by abcoxyzide ▴ 20