Perhaps my Google-fu is failing me, because this doesn't seem like an uncommon need, but I can't find any answer here.
Say I have a SAM/BAM file, and a variant at a known location. I'd like to extract all the alignments from the BAM file which contain that variant. That is, I'd like the ALT, not the REF allele. I do not want all the reads at that location. Is there an existing tool or workflow out there that can accomplish this?
I would honestly be surprised if there wasn't, since I would imagine it's a common situation to be interested in a certain variant and investigate the reads supporting it. Specifically, I'd like to make sure there's no bias in where along the read it falls, or the strand of the read. I should note that my coverage is very high, so it's not feasible to extract all the reads covering the site and manually sort them out.
I've already started a script that does this by parsing CIGAR strings, but I wanted to check that it doesn't already exist before going further down this road.
Maybe something like this:
@PhilS: your solution won't help parsing the cigar string.
Sry I thought you need the whole information of the read given by a bam file. If you need the cigar string only you can do it like:
this will give you the cigar strings of all reads containing 'pattern'
no, you have to walk over the cigar string to see if the read contains or not the variation. You cannot use a regex for this.