Question

index of the residue matched using pairwise2 in biopython

0

Entering edit mode

6.5 years ago

akdbharadwajiitkgp • 0

I am interested in knowing the index of residues that match to a string using pairwise 2 in python.

For example I have two strings

A:' EEEEE      HHH     HHH             EEEEE'

and

B: 'EEE       EEEE       HHH'

and I am interested in finding a the best match between both of them using local pairwise2 function in biopython. One of the alignments that I get is :

EEE-------EE---      HHH     HHH             EEEEE
|||       ||   |||||||||
EEE       EEEE       HHH--------------------------
  Score=29.6

I want to get the indices of the match i.e the positions of all the Es, Hs and ' ' from seq A that matched with seq B.

How do I do that?

python sequence pairwise biopython • 3.4k views

ADD COMMENT • link updated 6.5 years ago by Russ ▴ 520 • written 6.5 years ago by akdbharadwajiitkgp • 0

score 1 · Answer 1 · 2018-06-07

Hello,

you could write a function which iterates over the alignment strings and look at which position are the same symbols.

from Bio import pairwise2

def match_index(alignment):
    matches = []

    for i, (a, b) in enumerate(zip(alignment[0], alignment[1])):
        if a == b:
            matches.append(i)

    return matches

seq1 = " EEEEE      HHH     HHH             EEEEE"
seq2 = "EEE       EEEE       HHH"

alignments = pairwise2.align.localxx(seq1, seq2)
m = match_index(alignments[0])
print(m)

fin swimmer

score 1 · Answer 2 · 2018-06-07

I took a look at the pairwise2 module but also couldn't find any easy way to obtain the indices of the matches ("|"). I was able to modify the format_alignment method slightly to print a list of indices (that starts at 0) of the pipe operator for this specific example - but I'm not sure how it will perform under different testing conditions. Where this may fail is if the "begin" variable != 0 - you'll have to keep an eye out for that.

from Bio import pairwise2
from Bio.pairwise2 import format_alignment


## This is a modification of the format_alignment method 
def match_index(align1, align2, score, begin, end): 
      """Format the alignment prettily into a string. 

      Since Biopython 1.71 identical matches are shown with a pipe 
      character, mismatches as a dot, and gaps as a space. 

      Note that spaces are also used at the start/end of a local 
      alignment. 

      Prior releases just used the pipe character to indicate  
      aligned region (matches, mismatches and gaps). 
      """ 
      s = [] 
      s.append("%s\n" % align1) 
      s.append(" " * begin) 
      for a, b in zip(align1[begin:end], align2[begin:end]): 
          if a == b: 
              s.append("|")  # match 
          elif a == "-" or b == "-": 
              s.append(" ")  # gap 
          else: 
              s.append(".")  # mismatch 
      s.append("\n") 
      s.append("%s\n" % align2) 
      s.append("  Score=%g\n" % score)

      ## Obtain indices of matching characters (indicated by the "|" character)
      c = []
      for pos, char in enumerate(s):
        pipe = "|"
        if char == pipe:
            c.append(pos-2)


      return(c)
      return ''.join(s) 


alignments = pairwise2.align.globalxx(' EEEEE      HHH     HHH             EEEEE', 'EEE       EEEE       HHH')
print(format_alignment(*alignments[0]))
print(match_index(*alignments[0]))

Results:

 EEEEE      HHH     HHH      ----       EEEEE---
   |||             |   ||||||    |||||||        
---EEE------------- ---      EEEE       -----HHH
  Score=17

[3, 4, 5, 19, 23, 24, 25, 26, 27, 28, 33, 34, 35, 36, 37, 38, 39]

The other caveat is that I was unable to exactly replicate your alignment without the code that you used, so the alignment depicted here is different than the one in your example. Hope this helps.