index of the residue matched using pairwise2 in biopython
2
0
Entering edit mode
6.5 years ago

I am interested in knowing the index of residues that match to a string using pairwise 2 in python.

For example I have two strings

A:' EEEEE      HHH     HHH             EEEEE'

and

B: 'EEE       EEEE       HHH'

and I am interested in finding a the best match between both of them using local pairwise2 function in biopython. One of the alignments that I get is :

EEE-------EE---      HHH     HHH             EEEEE
|||       ||   |||||||||
EEE       EEEE       HHH--------------------------
  Score=29.6

I want to get the indices of the match i.e the positions of all the Es, Hs and ' ' from seq A that matched with seq B.

How do I do that?

python sequence pairwise biopython • 3.4k views
ADD COMMENT
1
Entering edit mode
6.5 years ago

Hello,

you could write a function which iterates over the alignment strings and look at which position are the same symbols.

from Bio import pairwise2

def match_index(alignment):
    matches = []

    for i, (a, b) in enumerate(zip(alignment[0], alignment[1])):
        if a == b:
            matches.append(i)

    return matches

seq1 = " EEEEE      HHH     HHH             EEEEE"
seq2 = "EEE       EEEE       HHH"

alignments = pairwise2.align.localxx(seq1, seq2)
m = match_index(alignments[0])
print(m)

fin swimmer

ADD COMMENT
1
Entering edit mode
6.5 years ago
Russ ▴ 520

I took a look at the pairwise2 module but also couldn't find any easy way to obtain the indices of the matches ("|"). I was able to modify the format_alignment method slightly to print a list of indices (that starts at 0) of the pipe operator for this specific example - but I'm not sure how it will perform under different testing conditions. Where this may fail is if the "begin" variable != 0 - you'll have to keep an eye out for that.

from Bio import pairwise2
from Bio.pairwise2 import format_alignment


## This is a modification of the format_alignment method 
def match_index(align1, align2, score, begin, end): 
      """Format the alignment prettily into a string. 

      Since Biopython 1.71 identical matches are shown with a pipe 
      character, mismatches as a dot, and gaps as a space. 

      Note that spaces are also used at the start/end of a local 
      alignment. 

      Prior releases just used the pipe character to indicate  
      aligned region (matches, mismatches and gaps). 
      """ 
      s = [] 
      s.append("%s\n" % align1) 
      s.append(" " * begin) 
      for a, b in zip(align1[begin:end], align2[begin:end]): 
          if a == b: 
              s.append("|")  # match 
          elif a == "-" or b == "-": 
              s.append(" ")  # gap 
          else: 
              s.append(".")  # mismatch 
      s.append("\n") 
      s.append("%s\n" % align2) 
      s.append("  Score=%g\n" % score)

      ## Obtain indices of matching characters (indicated by the "|" character)
      c = []
      for pos, char in enumerate(s):
        pipe = "|"
        if char == pipe:
            c.append(pos-2)


      return(c)
      return ''.join(s) 


alignments = pairwise2.align.globalxx(' EEEEE      HHH     HHH             EEEEE', 'EEE       EEEE       HHH')
print(format_alignment(*alignments[0]))
print(match_index(*alignments[0]))

Results:

 EEEEE      HHH     HHH      ----       EEEEE---
   |||             |   ||||||    |||||||        
---EEE------------- ---      EEEE       -----HHH
  Score=17

[3, 4, 5, 19, 23, 24, 25, 26, 27, 28, 33, 34, 35, 36, 37, 38, 39]

The other caveat is that I was unable to exactly replicate your alignment without the code that you used, so the alignment depicted here is different than the one in your example. Hope this helps.

ADD COMMENT

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6