I am trying to use pysam to extract the CIGAR string from a bam file of aligned long-read RNA sequencing data. I want to extract the CIGAR string from only a certain region in the read, NOT the CIGAR string of the entire read that overlaps this region. The reason for this is that I am working with long read sequencing data, so the CIGAR string is quite long, and I want to look at the string surrounding each intron in each read.
As input, I have the aligned bam file and a bed file of the introns (+/- a window) that I am interested in.
I have currently been trying to use pysam to extract the CIGAR string from an intron as follows:
import pysam
bamfile = pysam.AlignmentFile("wt_reads.bam", "rb")
for read in bamfile.fetch('chr9', 108339540, 108339796):
print(read.cigarstring)
This outputs a list of CIGAR strings for reads overlapping the region in .fetch(), for example:
354M218N187M
283M218N559M
283M218N141M
However, this is the CIGAR string for entire reads (with lengths of 759, 1060, and 642 in the above example). I would like only the CIGAR string of the 256 nt region chr9:108339540-108339796. Eventually I would like to return the CIGAR strings for all regions (introns) in the bed file.
Is there a way to do this with pysam or other tools?
Hello,
no solution but a more general procedure to get you result: You must take the mapping starting position into account and calculate at which position in the read's sequence your region of interest starts and ends. According to this you have to manipulate/trim your CIGAR.
fin swimmer