----------------------------------------------------------

Question

How to display mismatched sequences from alignment when using Biopython

0

Entering edit mode

7.3 years ago

Charles Yin ▴ 180

When using Biopython alignment to align two protein sequences, the output does not show the mismatched amino acids. Would you please have some suggestions to display the mismatched amino acids by Biopython?

The following is the code for alignment by Biopython. I tried different options, the problem could not be resolved.

----------------------------------------------------------

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

proteinSeqQuery='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
proteinSeqseqFound='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

alignments = pairwise2.align.localxx(proteinSeqQuery,proteinSeqseqFound)
print('Score',alignments[0][2])
print(format_alignment(*alignments[0]))

----------------------------------------------------------------

The program output and the output from EMBI-EBI EMBOSS Water (Protein Alignment) are shown as follows:

Biopython:
HRTLGLLLHTQVSI-QQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HRTLGLLLHTQVSIKQ-LLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK


EMBL-EBI:
EMBOSS_001         1 HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYW     50
                     ||||||||||||||:||||||||||||||||||.||||||||||||||||
EMBOSS_001         1 HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYW     50

EMBOSS_001        51 KLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK     82
                     |||||||||||||||||||||||||||||:||
EMBOSS_001        51 KLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK     82

and in this picture: output

Thank you in advance!

alignment biopython • 9.0k views

ADD COMMENT • link updated 7.3 years ago by st.ph.n ★ 2.7k • written 7.3 years ago by Charles Yin ▴ 180

1

Entering edit mode

I added (code) markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks! I will use this approach for formatting code.

ADD REPLY • link 7.3 years ago by Charles Yin ▴ 180

0

Entering edit mode

Please say which version of Biopython you are using, since the pairwise code was rewritten in Biopython 1.68

ADD REPLY • link 7.3 years ago by Peter 6.0k

score 2 · Answer 1 · 2017-07-30

2

Entering edit mode

7.3 years ago

Joe 21k

Here's a bit of a bodge that might work. Note though that it doesn't take in to consideration amino acid similarity. It would be easy enough to code in with a few more elif statements though provided you can put together a table of what matches with what.

In this instance it just leaves a blank where there's no match, by simply iterating both strings from pairwise2 and checking equivalency. You could create a slightly fancier print output too if desired but I think this works.

In essence all you need to do is create a custom string printed in the middle of the 2 alignments. It's generated by appending to a list, and then joining the list to make another string. Standard use of monospaced fonts in terminals etc means that the alignments of all the character columns is taken care of for you.

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

proteinSeqQuery='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
proteinSeqseqFound='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

alignments = pairwise2.align.localxx(proteinSeqQuery,proteinSeqseqFound)

match = []

for a, b in zip(alignments[0][0],alignments[0][1]):
        if a == b:
                match.append('|')
        else:
                match.append(' ')


print(alignments[0][0])
print("".join(match))
print(alignments[0][1])

Yields:

HRTLGLLLHTQVSIQ-QLLKLPAECFHPKPKVNSV-LIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKH-AK
||||||||||||||  ||||||||||||||||||  |||||||||||||||||||||||||||||||||||||||||||||  ||
HRTLGLLLHTQVSI-KQLLKLPAECFHPKPKVNS-ALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMK-YAK

ADD COMMENT • link 7.3 years ago by Joe 21k

0

Entering edit mode

It works! Thank you very much!

ADD REPLY • link 7.3 years ago by Charles Yin ▴ 180

0

Entering edit mode

Great, if this answers your question satisfactorily, be sure to accept the answer (tick on the left hand side of the post).

ADD REPLY • link 7.3 years ago by Joe 21k

0

Entering edit mode

Also I'd maybe play around with the gap penalty and mismatch penalty a little bit since that alignment seems to be scoring gaps quite low.

For instance, it looks to me like the 2 gaps here should instead just be a mismatch, but I could be wrong...

   QVSIQ-QLLKL
   ||||  |||||
   QVSI-KQLLKL

Would make more sense to me as:

   QVSIQQLLKL
   |||| |||||
   QVSIKQLLKL

ADD REPLY • link 7.3 years ago by Joe 21k

1

Entering edit mode

Thank you! yes, that is a problem. What gap penalty do you use in localxx() for the correct result (second one)? It looks localms() function should be used. The following is the improved version.

from Bio import pairwise2

seqQuery='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
seqFound='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

# Four parameters (2,-1,-1,-1) in alignment:Identical characters are given 2 points, 1 point is deducted for each non-identical character.
# 1 point is deducted when opening a gap, and 1 point is deducted when extending it.

alignments = pairwise2.align.localms(seqQuery,seqFound, 2, -1, -1, -1)

match = []

for a, b in zip(alignments[0][0],alignments[0][1]):
        if a == b:
                match.append('|')
        else:
                match.append('*')

m="".join(match)
s=[]
s.append(alignments[0][0]+'\n')
s.append(m+'\n')
s.append(alignments[0][1])

alignedSeqs="".join(s)
print('\n')
print(alignedSeqs)

The output is as follows:

HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK
||||||||||||||*||||||||||||||||||*|||||||||||||||||||||||||||||||||||||||||||||*||
HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK

ADD REPLY • link 7.3 years ago by Charles Yin ▴ 180

1

Entering edit mode

Yep that looks to be more reasonable. Opening a gap is usually the most heavily penalised action AFAIK, so that makes for a much more parsimonious/high scoring alignment.

ADD REPLY • link 7.3 years ago by Joe 21k

0

Entering edit mode

I didn't, that was just a mockup I did by hand from the sequence. I think the default gap penalty for BLAST is -10 or -6. Maybe have a look around at what other common parameters use and try that?

ADD REPLY • link 7.3 years ago by Joe 21k

0

Entering edit mode

Thanks! There are a lot of confusions in the alignment functions in Biopython. I currently use Biopython v1.67, and will upgrade to v1.68.

ADD REPLY • link 7.3 years ago by Charles Yin ▴ 180

1

Entering edit mode

The current release is Biopython 1.70, but please do try upgrading since the pairwise code was rewritten in Biopython 1.68

ADD REPLY • link 7.3 years ago by Peter 6.0k

1

Entering edit mode

Have a look in Biopython's API. pairwise2 is complex and allows many options. To get the same results as in EMBOSS' WATER you should use the same parameters, e.g. using align.localds(seq1, seq2, matrix, gap-open-penalty, gap-extend-penalty). In the API is an example how to implement the matrix.

ADD REPLY • link 7.3 years ago by Markus ▴ 320

0

Entering edit mode

Is there a way to make this faster? It seems really slow right now.

ADD REPLY • link 6.3 years ago by biogirl04 ▴ 10

0

Entering edit mode

If you’re looking for speed you’re probably not looking for Biopython.

ADD REPLY • link 6.3 years ago by Joe 21k

score 0 · Answer 2 · 2017-08-01

0

Entering edit mode

7.3 years ago

st.ph.n ★ 2.7k

#!/usr/bin/env python

query='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
target='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

aln = ''

for n in range(len(query)):
    if query[n] == target[n]:
        aln += '.'
    else:
        aln += query[n]

print '{}\t{}'.format('Query', query)
print '{}\t{}'.format('Aln', aln)
print '{}\t{}'.format('Target', target)