Question

Biopython pairwise sequence alignment behavior change between 1.59 and 1.76

0

Entering edit mode

4.8 years ago

sanner ▴ 20

Hello

I am in the process of porting code from Python 2.7 to 3.7 and I am running into some differences in BioPython

The following 3 lines of code:

from Bio import pairwise2
aln = pairwise2.align.globalxx('EPQYEEIPIYL', 'EPQ?EEIPIYL', one_alignment_only=True)[0]
print(aln)

on Python 2.7 with Bio v 1.59 yield:

('EPQYEEIPIYL', 'EPQ?EEIPIYL', 10.0, 0, 11)

while on 3.7 (using miniconda, Bio v 1.76) I get

('EPQY-EEIPIYL', 'EPQ-?EEIPIYL', 10.0, 0, 12)

creating a spurious gap in the alignment

Thanks for any help with that

-Michel

alignment software error sequence • 3.3k views

ADD COMMENT • link 4.8 years ago by sanner ▴ 20

0

Entering edit mode

not sure what is your question, but seems it's an issue to report to BioPython developers

ADD REPLY • link 4.8 years ago by JC 13k

0

Entering edit mode

1.59 is pretty ancient, and the entire pairwise2 module was overhauled in, I think, 1.7. I'm not sure how it handles ? Characters now, have you tried without the ?. I vaguely recall possibly needing to specify the ambiguous character.

ADD REPLY • link 4.8 years ago by Joe 22k

0

Entering edit mode

Thanks Joe I replaced ? by X which should stand for 'ANY' in the FASTA format but I get the same result.

JC, My issue is that I expect pairwise2.align.globalxx('EPQYEEIPIYL', 'EPQXEEIPIYL') to return a 1 to one mapping with Y mapped to X as it use to do in 1.7 instead of creating a insertion in both chains to avoid this mapping from occuring.

-Michel

ADD REPLY • link 4.8 years ago by sanner ▴ 20

0

Entering edit mode

You could try updating again to 1.8. I haven't looked at changelogs, but perhaps something got broken in 1.78. Otherwise this is probably an issue to report on their GitHub

ADD REPLY • link 4.8 years ago by Joe 22k

1

Entering edit mode

I tried with 1.9 and get the same behavior so I did submit an issue to github. Thanks

ADD REPLY • link 4.8 years ago by sanner ▴ 20

GenoMax · Answer 1 · 2020-10-09

Markus Piotrowski gave a nice explanation and solution

The aim of a pairwise alignment algorithm is to achieve an alignment with the best score (or the least costs). From the view of the algorithm (here Needleman-Wunsch-Gotoh) the situation "one gap in seqA followed by one gap in seqB" is not forbidden. The fact that the older implementation of this algorithm in Biopython's pairwise2 did not allow this, was a shortcoming and was criticized e.g. in this paper. That was the reason for re-writing pairwise2 for Biopython 1.68 (in 2016) (with the side effect, however, unrelated to the original issue, of increased performance regarding speed and memory usage). As you see, the alignment with the "spurious insertion" in your example has the same score as the other one, and in fact, the other alignment is also produced (would have index [1]).

This behavior is documented in the docstring of the module which can be read in our API documentation, in detail:

Depending on the penalties, a gap in one sequence may be followed by a gap in the other sequence. If you don’t like this behaviour, increase the gap-open penalty:

>>> for a in pairwise2.align.globalms("A", "T", 5, -4, -1, -.1):
    print(format_alignment(*a))

A-

-T
  Score=-2

>>> for a in pairwise2.align.globalms("A", "T", 5, -4, -3, -.1):
    print(format_alignment(*a))
A
.
T
  Score=-4

This also explains how you can prevent this: by increasing the gap penalty (two gaps should be more expensive than one mismatch). In fact, that is what most "standard settings" in alignment programs do.

Your example shows an amino acid alignment and I doubt that globalxx comes close to meaningful alignments in such cases (with 'real' proteins). You might consider using a substitution matrix:

>>> from Bio import pairwise2 as pw
>>> from Bio.Align import substitution_matrices

Warning (from warnings module): ...

>>> blosum62 = substitution_matrices.load("BLOSUM62")
>>> aln = pw.align.globalds("EPQYEEIPIYL", "EPQ*EEIPIYL", blosum62, -10, -0.5)[0]  # replacing your ? with a * which is in the matrix
>>> print(pw.format_alignment(*aln))
EPQYEEIPIYL
|||.|||||||
EPQ*EEIPIYL
  Score=49

Alternatively, if you want to keep the simple matching scheme, use globalms to set different gap parameters:

>>> aln = pw.align.globalms("EPQYEEIPIYL", "EPQ?EEIPIYL", 1, -1, -2, -.5)[0]
# identical to: aln = pw.align.globalms("EPQYEEIPIYL", "EPQ?EEIPIYL", match=1, mismatch=-1, open=-2, extend=-.5)[0]
>>> print(pw.format_alignment(*aln))
EPQYEEIPIYL
|||.|||||||
EPQ?EEIPIYL
  Score=9

Finally, if you are doing many pairwise alignments, you should consider using our new PairwiseAligner which has a much better performance and other advantages (the latter because it is implemented as a class).