Python Cigar String - Finding Indels Break Points Positions
2
0
Entering edit mode
11.4 years ago
stolarek.ir ▴ 700

Hi. I am trying to retrieve exact break point positions from the CIGAR string using python. I know how to use the regex to retreieve integers that are upstream the letter denoting insertion or deletion. How can I sum these integers, or do anything else, so that my script will report the start and end position of my indel?

import re
a = '12M1I'

if 'I' or 'M' in a:
    matchI = re.findall(r'(\d+)I', a)
    intlistI = [int(x) for x in matchI]   
    print matchI
    matchM = re.findall(r'(\d+)M', a)
    intlistM = [int(x) for x in matchM]
    print matchM

or just simply:

match = re.findall(r'(\d+)(\w)', a)
print match
cigar • 14k views
ADD COMMENT
3
Entering edit mode
11.4 years ago
Gabriel R. ★ 2.9k

If I were you, I would simply use pysam (as much as I dislike that "library", it's still good to read bam files albeit slowly), the cigarline is parsed automatically. In the following example you have the type and the length :

#!/usr/bin/python

import sys; 

import pysam

bamFile    = sys.argv[1];

bamFP = pysam.Samfile(bamFile, "rb");

for read in bamFP:
      if( not( read.is_unmapped ) ):   #if it's mapped
                    cigarLine=read.cigar;

          for (cigarType,cigarLength) in cigarLine:
            try:
                if(  cigarType == 0): #match                  
                elif(cigarType == 1): #insertions
                elif(cigarType == 2): #deletion
                elif(cigarType == 3): #skip
                elif(cigarType == 4): #soft clipping
                elif(cigarType == 5): #hard clipping
                elif(cigarType == 6): #padding
                else:
                    print "Wrong CIGAR number";
                    sys.exit(1);
            except:
                print "Problem";

And double check the cygartypes, it's been a while since I used that script :-)

ADD COMMENT
0
Entering edit mode

Hmm it's not really helpfull. I'm already using pysam later in my script. I do have the reports of whether I have an insertion or deletion. I only want the script to report the exact location of my indel. So going with the starting position of read alignment what is the position of the indel. example: I have read 10M1I10M1D starting position let's say 10 and I would like an output: 1 insertion 20 1 deletion 31

ADD REPLY
0
Entering edit mode

store the initial position read.pos in a variable, for each match/deletion, increase the variable by cigarLength, report the insertions/deletion.

ADD REPLY
0
Entering edit mode

Yes, the problem is simple: a = '10M1I10M1D' pos = 10 match = re.findall(r'(\d+)(\w)', a) print match for i in match: #print i[0],i[1] posindel = pos + int(i[0]) print posindel,i[0],i[1]

Does not report the correct position of every next event

ADD REPLY
0
Entering edit mode

I am confused, is it a CIGAR problem or a problem with the program ?

ADD REPLY
2
Entering edit mode
11.4 years ago
stolarek.ir ▴ 700

Ok, the solution was found, if anyone will be ver interested:

>>> import re
>>>
>>> a = '1D10M1I10M1D'
>>>
>>> start = 10
>>> for num1, i_or_d, num2, m in re.findall('(\d+)([ID])(\d+)?([A-Za-z])?', a):
...     print num1, i_or_d, start
...     if num1:
...         start += int(num1)
...     if num2:
...         start += int(num2)
...
1 D 10
1 I 21
1 D 32

Thread : http://stackoverflow.com/questions/17526851/regular-expression-report-position-by-integer-count-from-string

ADD COMMENT

Login before adding your answer.

Traffic: 1804 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6