Different alignment results between Emboss Needle and Biopython pairwise2
1
5
Entering edit mode
6.7 years ago
markd ▴ 60

I'm trying to figure out why Emboss Needle and Biopython pairwise2.align.global give different results.

I start with two sequences that are difficult to align (generated randomly)

seq 1: CATTTTTTGAACAATCAGATCAGAAGCACCTTCGGGATAATCCTCACGGTTCTTGGGCATACGCCTCATGGTAAC
seq 2: ATGGAGCAGTGGGTTGTTACAGATCTCATAATCAGGGAACAGGTCTGGATAATGCCCCATCTCCACGATATCCCG

Emboss needle gives the following alignment, with parameters gapopen=10, gapextend=0.5, datafile=EDNAFULL

------------------AATAGTGTACCTGGCC----GGA-TCTTTATTCCTTCCAGGACAACT-----------CA---CATCTGCCCCACTACCCTAAACAAAATTTAA
CATTTTTGGAGTCACGTCAACAG---ACC--GCCCGGGGGATTCGTGACTAATCGCAGGA----TGGAGCGTGGAGCAGGCCAT----------------------------

However, I cannot get pairwise2.align.global to recreate this alignment. I have tried using globalds with the same parameters, and a dictionary version of EDNAFULL. I've also played with the penalize_extend_when_opening and penalize_end_gaps with no success. Here is a sample alignment from the pairwise2 output and both of the flags previously mentioned set to false.

------CA-----TTTTTTGAACA-ATCAGATCAGAAGCACCTTCGGGATAATCCTCACGGTTCTTGGGCATAC-GCCTCAT------GGTAAC---
ATGGAGCAGTGGGTTGTT---ACAGATC---TCATAATCA-----GGGA--------ACAGGTCT---GGATAATGCCCCATCTCCACGATATCCCG

Both of these programs are just meant to implement the Needleman-Wunsch algorithm, so why are they different and which one should I trust?

alignment Needleman–Wunsch biopython • 6.1k views
ADD COMMENT
3
Entering edit mode

I'm also getting different results.

Biopython:

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

seq1 = 'GGCATGTACCTGTCGTGTGCGTTGTATCCCCCACGCTTGATAGACGTAGAGCTGCCTTGTAACGTGGGAGTGAAA'
seq2 = 'ACGATCTATGGGGCTAACTTCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACAACGACACCCACCCTTAGC'

alignments = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -0.5)
print(format_alignment(*alignments[0]))

Alignment:

-----------GGCATGTACCTGTCGTGTGCGTT-------------------GTATCC-----------CCCACGCTTGATAGACGTAGAGCTGCCTTGTAACGTGGGAGTGAAA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ACGATCTATGGGGC---TAACT----TCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACAACGACACCCACCCTT-----------AGC-----------------------
  Score=-0.5

Emboss needle:

needle -asequence seq1.fa -bsequence seq2.fa -gapopen 10 -gapextend 0.5 -outfile aln.needle

Alignment:

########################################
# Program: needle
# Rundate: Tue  3 Apr 2018 14:12:49
# Commandline: needle
#    -auto
#    -stdout
#    -asequence emboss_needle-I20180403-141248-0665-8944238-pg.asequence
#    -bsequence emboss_needle-I20180403-141248-0665-8944238-pg.bsequence
#    -datafile EDNAFULL
#    -gapopen 10.0
#    -gapextend 0.5
#    -endopen 10.0
#    -endextend 0.5
#    -aformat3 pair
#    -snucleotide1
#    -snucleotide2
# Align_format: pair
# Report_file: stdout
########################################

#=======================================
#
# Aligned_sequences: 2
# 1: EMBOSS_001
# 2: EMBOSS_001
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 118
# Identity:      22/118 (18.6%)
# Similarity:    22/118 (18.6%)
# Gaps:          86/118 (72.9%)
# Score: 47.5
# 
#
#=======================================

EMBOSS_001         1 GGCATGTACCTGTCGTGTGCGTTGTATCCCCCACGCTTGATAGACG----     46
                                                                |||    
EMBOSS_001         1 -------------------------------------------ACGATCT      7

EMBOSS_001        47 -TAGAGCTGCCTTGTAACGTGGGAG--TGAAA------------------     75
                      |.|.|||..|||.|.|..|.||||  |.|||                  
EMBOSS_001         8 ATGGGGCTAACTTCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACA     57

EMBOSS_001        76 ------------------     75

EMBOSS_001        58 ACGACACCCACCCTTAGC     75


#---------------------------------------
#---------------------------------------
ADD REPLY
0
Entering edit mode

Thanks for looking into this!

Unfortunately, while those parameters do happen to work for those sequences, its trivial to find sequences that don't match given those parameters.

For example

seq_1 = 'GGCATGTACCTGTCGTGTGCGTTGTATCCCCCACGCTTGATAGACGTAGAGCTGCCTTGTAACGTGGGAGTGAAA'
seq_2 = 'ACGATCTATGGGGCTAACTTCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACAACGACACCCACCCTTAGC'

Produces different alignments again.

Btw, I'm just generating the strings with

seq_1 = ''.join([random.choice('ATCG') for _ in range(75)])
seq_2 = ''.join([random.choice('ATCG') for _ in range(75)])
ADD REPLY
0
Entering edit mode

Wooo, you're right. I'm moving my answer as comment. I think the difference in alignments and scores may come from the fact that needle has two additional parameters (not present in biopython): endopen and endextend.

ADD REPLY
0
Entering edit mode

I explored that a little too, but by default the -endweight parameter is set to False, which means endopen/endextend shouldn't make any difference. Also as I said in the post, I tried playing with the penalize_extend_when_opening and penalize_end_gaps parameters in biopython but still couldn't get the alignments the same.

ADD REPLY
0
Entering edit mode

Hello, as I am a green hand in programming, I wonder how to get the result like :

Length: 118

Identity: 22/118 (18.6%)

Similarity: 22/118 (18.6%)

Gaps: 86/118 (72.9%)

Score: 47.5

Is it get from the python on the window system or it is the result from linux system(with shell or something like that?) Thank you !

ADD REPLY
0
Entering edit mode

Please just ask a new question instead of adding a tangent to an existing one. Please also note, if you do open a new thread we require much more information, and you should show some effort you’ve made to solve the problem.

ADD REPLY
2
Entering edit mode

Tagging Dr. Peter Cock: Peter

ADD REPLY
0
Entering edit mode

What version of Biopython are you on?

The pairwise module was re-written in (I think) 1.68, but it may be later. You'll get different results with the rewritten version to the old version. Peter would know for sure.

ADD REPLY
0
Entering edit mode

I am on 1.71 master.

ADD REPLY
0
Entering edit mode

Have you also tried the variations on global? e.g. globalxx, globalms, globalmx etc etc.

I've never really delved in to/understood the differences, but I know there are multiple options for local and global

ADD REPLY
0
Entering edit mode

Some of those alternatives (I think the ones ending in -x) allow you to define custom scoring functions.

ADD REPLY
0
Entering edit mode

I'm using globalds with the EDNAFULL matrix converted into a dictionary for the match_dict parameter. This should be the most accurate way to reproduce Emboss Needle I think.

Below is my full code to translate the EDNAFULL matrix

EDNAFullLetters = ['A', 'T', 'G', 'C', 'S', 'W', 'R', 'Y', 'K', 'M', 'B', 'V', 'H', 'D', 'N', 'U']
EDNAFullScores = [
    [5,-4,-4,-4,-4, 1, 1,-4,-4, 1,-4,-1,-1,-1,-2,-4],
    [-4, 5,-4,-4,-4, 1,-4, 1, 1,-4,-1,-4,-1,-1,-2, 5],
    [-4,-4, 5,-4, 1,-4, 1,-4, 1,-4,-1,-1,-4,-1,-2,-4],
    [-4,-4,-4, 5, 1,-4,-4, 1,-4, 1,-1,-1,-1,-4,-2,-4],
    [-4,-4, 1, 1,-1,-4,-2,-2,-2,-2,-1,-1,-3,-3,-1,-4],
     [1, 1,-4,-4,-4,-1,-2,-2,-2,-2,-3,-3,-1,-1,-1, 1],
     [1,-4, 1,-4,-2,-2,-1,-4,-2,-2,-3,-1,-3,-1,-1,-4],
    [-4, 1,-4, 1,-2,-2,-4,-1,-2,-2,-1,-3,-1,-3,-1, 1],
    [-4, 1, 1,-4,-2,-2,-2,-2,-1,-4,-1,-3,-3,-1,-1, 1],
     [1,-4,-4, 1,-2,-2,-2,-2,-4,-1,-3,-1,-1,-3,-1,-4],
    [-4,-1,-1,-1,-1,-3,-3,-1,-1,-3,-1,-2,-2,-2,-1,-1],
    [-1,-4,-1,-1,-1,-3,-1,-3,-3,-1,-2,-1,-2,-2,-1,-4],
    [-1,-1,-4,-1,-3,-1,-3,-1,-3,-1,-2,-2,-1,-2,-1,-1],
    [-1,-1,-1,-4,-3,-1,-1,-3,-1,-3,-2,-2,-2,-1,-1,-1],
    [-2,-2,-2,-2,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-2],
    [-4, 5,-4,-4,-4, 1,-4, 1, 1,-4,-1,-4,-1,-1,-2, 5]]

EDNAFull = dict()
for i, a in enumerate(EDNAFullLetters):
    for j, b in enumerate(EDNAFullLetters):
        EDNAFull[(a, b)] = EDNAFullScores[i][j]

for aln in pairwise2.align.globalds(seq_1, seq_2, EDNAFull, -10, -0.5):
    print pairwise2.format_alignment(*aln)
ADD REPLY
0
Entering edit mode

I'm using BioPython 1.70.

ADD REPLY
3
Entering edit mode
5.8 years ago
Markus ▴ 320

Sorry for the late answer, I didn't see this before. I would recommend to send such things to Biopython's Github site.

The standard in EMBOSS' Needle is not to penalize end gaps (END GAP PENALTY: FALSE). This means: Gaps at the beginning and the end of the alignment are 'for free'. You must do the same in Biopython's pairwise2 with the keyword parameter penalize_end_gaps=False.

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

seq1 = 'GGCATGTACCTGTCGTGTGCGTTGTATCCCCCACGCTTGATAGACGTAGAGCTGCCTTGTAACGTGGGAGTGAAA'
seq2 = 'ACGATCTATGGGGCTAACTTCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACAACGACACCCACCCTTAGC'

alignments = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -0.5,
                                      penalize_end_gaps=False)
print(format_alignment(*alignments[0]))

Then you get the same result as in EMBOSS' Needle:

GGCATGTACCTGTCGTGTGCGTTGTATCCCCCACGCTTGATAGACG-----TAGAGCTGCCTTGTAACGTGGGAG--TGAAA------------------------------------
                                           |||     |.|.|||..|||.|.|..|.||||  |.|||                                    
-------------------------------------------ACGATCTATGGGGCTAACTTCTCAGTTAGGAGGCTAAAACCTACAAATAACCCCACAACGACACCCACCCTTAGC
  Score=47.5

You must not change penalize_extend_when_opening, this parameter is important for how a gap opening is penalized: Only gap-open penalty (default, also in EMBOSS), or gap-open penalty + gap-extension penalty (unusual).

ADD COMMENT

Login before adding your answer.

Traffic: 1672 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6