edited post
Hello all,
I want to get all genomic locations (start and end) where the alignment occurred. For this, I am trying to write a python script. I am planning to use cigar string from sam file to find a number of matches and starting position of the alignment.
I have multiple lists of the tuple. (i, j)
[(0, 117), (3, 29773), (0, 253), (2, 1325), (0, 145)]
[(0, 116), (2, 1), (3, 3419), (0, 327), (3, 21529), (0, 286), (2, 1)]
[(0, 117), (3, 25275), (0, 180), (1, 1), (0, 1), (3, 5895), (0, 145)]
And I have another list which consists of some numbers.
[66905968, 66906104, 66905996]
In desired output:
I want to add the values (j) from the tuple if i = 0 or 2 for each number on my list. With one condition: every time value of i is 3 it should stop adding and use that number as next starting point.
For example for:
[(0, 117), (3, 29773), (0, 253), (2, 1325), (0, 145)]
and
66905968
I want:
66905968 , 66905968+117
66905968+117+29773, 66905968+117+29773+253+1325+145
I have the following code so far:
import pysam
import sys
pos = []
new = []
reffile = pysam.Fastafile("ref.fasta")
pure_bam = pysam.AlignmentFile('sample.bam', "rb")
for read in pure_bam:
for read in pure_bam:
pos.append(read.pos)
for pp_sam in pos:
for i , j in read.cigar:
while i == 0 or i == 2:
new.append(j + pp_sam)
This is definitely not giving the desired output. Could someone help me? Thank you very much. Thank you very much in advance.
Unless this is a learning exercise others have done this already:
Going From Cigar String In Sam To Genomic Coordinates?
Python Cigar String - Finding Indels Break Points Positions
and possibly others.
Could anyone get what am I doing wrong here?