Question

Deciphering Cigar

9

Entering edit mode

13.1 years ago

Leszek 4.2k

Hi, I need to get 5'-end position of each read in SAM file. For reads aligned in forward direction it's provided in SAM file (POS). But for reads aligned onto reverse strand the one has to figure it out from the cigar string. You cannot simple add length of the read to left-most position, as reads may be soft-clipped or some indel may occure.

I have written simple python function in order to find right most position of aligned read. Could you have a look and comment on that as I'm not sure if it's correct?

import re

#should nicely separate CIGAR entries
cigar_pat = re.compile(r"\d+[MIDNSHP=X]{1}")

def cigar2end( left,cigar ):
  """Return right-most position of aligned read."""
  #store info about each CIGAR category
  counts={ "M":0, #M 0 alignment match (can be a sequence match or mismatch)
           "I":0, #I 1 insertion to the reference
           "D":0, #D 2 deletion from the reference
           "N":0, #N 3 skipped region from the reference
           "S":0, #S 4 soft clipping (clipped sequences present in SEQ)
           "H":0, #H 5 hard clipping (clipped sequences NOT present in SEQ)
           "P":0, #P 6 padding (silent deletion from padded reference)
           "=":0, #= 7 sequence match
           "X":0, #X 8 sequence mismatch
        }
  #split cigar entries
  for centry in cigar_pat.findall(cigar):
    ccount  = int(centry[:-1])
    csymbol = centry[-1]
    counts[csymbol] = ccount
  #get number of aligned 'reference' bases
  aligned = counts["M"] + counts["D"] + counts["N"] #+ counts["="] + counts["X"]
  right   = left + aligned
  return right

cigar sam samtools next-gen alignment • 13k views

ADD COMMENT • link updated 3.4 years ago by cmdcolin ★ 4.3k • written 13.1 years ago by Leszek 4.2k

5

Entering edit mode

You might consider using pysam for this type of thing.

ADD REPLY • link 13.1 years ago by Sean Davis 27k

0

Entering edit mode

thanks, got it:)

ADD REPLY • link 13.1 years ago by Leszek 4.2k

0

Entering edit mode

I was writing my own codes to process the CIGAR line in order to extract different information that I needed. But now I has been using Pysam and I save a lot of time because it have very useful functions and transform the CIGAR line into a tuple very easy to process.

ADD REPLY • link 13.1 years ago by Geparada ★ 1.5k

Ram · Answer 1 · 2012-07-04

5

Entering edit mode

13.1 years ago

Leszek 4.2k

As Sean suggested I use pysam:

import pysam
samfile = pysam.Samfile( "somefile.bam","rb" )
#process all reads in region
for sam in samfile.fetch( ref,start,end ):
    print sam.pos, sam.aend

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 13.1 years ago by Leszek 4.2k

0

Entering edit mode

does your code return the same start/end ?

ADD REPLY • link 13.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

what do you mean by the same start/end? it returns first and last aligned base of the read on the ref

ADD REPLY • link 13.1 years ago by Leszek 4.2k

0

Entering edit mode

Thanks for sharing your code. I used them to find start and end position of reads but it asks me to define ref/start/end. Any suggestions?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by Bionic • 0

0

Entering edit mode

You need to provide the values for "ref", "start", and "end". The code, as written above, assumes that has been done elsewhere.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by Sean Davis 27k

score 0 · Answer 2 · 2022-03-13

0

Entering edit mode

3.4 years ago

marcomonti.mm94 • 0

Hi,

I found myself in a situation where I cannot use pysam. Therefore, I have created a python function to calculate the number of bases of each type in the CIGAR string. I started from your function. Maybe it could be useful also for you.

def cigar_to_dict(CIGAR: str):
# store info about each CIGAR category
cigar_dict = {"M": 0,  # M 0 alignment match (can be a sequence match or mismatch)
          "I": 0,  # I 1 insertion to the reference
          "D": 0,  # D 2 deletion from the reference
          "N": 0,  # N 3 skipped region from the reference
          "S": 0,  # S 4 soft clipping (clipped sequences present in SEQ)
          "H": 0,  # H 5 hard clipping (clipped sequences NOT present in SEQ)
          "P": 0,  # P 6 padding (silent deletion from padded reference)
          "=": 0,  # = 7 sequence match
          "X": 0}  # X 8 sequence mismatch

# Note the count variable is a string
count = ""
for i in CIGAR:
    if i.isnumeric():
        count = count + i
    else:
        cigar_dict[i] = cigar_dict[i] + int(count)
        count = ""      # reset count
return cigar_dict

ADD COMMENT • link 3.4 years ago by marcomonti.mm94 • 0

1

Entering edit mode

not sure this is correct, I don't know python well but if you are processing the string one character at a time then you will not be handling numbers >=10...

ADD REPLY • link 3.4 years ago by cmdcolin ★ 4.3k

0

Entering edit mode

To overcome the problem you mentioned I summed the numbers as strings: count = count + i . For example, the number 17 will be "1" + "7" = "17" and then I converted it in integer int("17"). P.S. I only sum the string if is a number (if i.isnumeric():). It is a bit ugly but it works.

ADD REPLY • link 3.4 years ago by marcomonti.mm94 • 0

1

Entering edit mode

Nice, good fix. that method is actually a bit faster than a regex based approach in a little microbenchmark i whipped up (js, not python) https://jsbench.me/gpl0pyj08n/1

ADD REPLY • link 3.4 years ago by cmdcolin ★ 4.3k