Entering edit mode
5.5 years ago
sppearce
▴
10
I have a vcf file describing a set of SNPs, insertions and deletions. I would like to extract those reads from the bam file which correspond to these variants for some quality control.
For SNPs, I can loop through using samjs and produce one new bam file per SNP, but I don't know how to generalise this for indels (and I want those specific ones that match the vcf file, not just any indel at that base).
Is there any existing tool that will let me do this? Or a way to get samjs to do this.
it's the same thing, you 'just' have to test for the cigar elements 'D/N' or 'I'.
That presumably would get me any indel that exists covering the start position, rather than the specific indel in the vcf (I realise that is probably pretty niche).
CIGAR string does not only contain the operations (D, I, etc) but also the length of the operation. By comparing the length with the indel size in VCF, maybe you would be able to count the reads that support the specific indel in question. I usually use pysam for this kind of CIGAR queries.
https://pysam.readthedocs.io/en/latest/