Determine fragment alignment overlaps by parsing CIGAR string in the MC tag
1
1
Entering edit mode
2.9 years ago
abcoxyzide ▴ 20

I have some paired end short read sequencing data. I'd like to know the fragment alignment overlap between the paired reads, and prune that part out for one of the reads to avoid processing the same region twice. For my specific downstream purposes, it has to be done in C.

My current idea is to use the MC tag, which contains the CIGAR string of the mate read, and pass it to bam_cigar2rlen(). Together with core.mpos, I should be able to determine the end position of the mate read.

However, bam_cigar2rlen() takes int n_cigar and const uint32_t *cigar for the arguments, and it is unclear to me how to get these parameters from the MC tag. I am also not sure if parsing the CIGAR string like this is the most efficient way.

I'm a novice to C and I'd highly appreciate any help. Thanks!

C htslib • 1.1k views
ADD COMMENT
3
Entering edit mode
2.9 years ago

Unless your data files have another (locally-defined) tagged field giving the mate's end position directly, computing it by parsing the MC tagged field and adding the length returned by bam_cigar2rlen() to MPOS is indeed the best way to compute this.

As you've seen, bam_cigar2rlen() uses HTSlib's already-parsed internal CIGAR representation. Since v1.12, HTSlib has sam_parse_cigar() for parsing a C string (as returned by bam_aux2Z() on your MC field) into the representation you need.

ADD COMMENT
1
Entering edit mode

Thank you so much, it worked wonderfully! I wasn't aware of the bam_aux2Z() function and was messing around with bam_aux_get_str(), which caused a few unexpected behaviors.

ADD REPLY

Login before adding your answer.

Traffic: 1615 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6