NcbiblastpCommandline alignment results are different from blast webpage
4
0
Entering edit mode
2.7 years ago

Hi I used this code to blast alignment between my fasta files.

seq1 = SeqIO.read(fasta_file1, "fasta")   
seq2 = SeqIO.read(fasta_file2, "fasta")   
output = NcbiblastpCommandline(query=seq1, subject=seq2, outfmt=0)()[0]

But the results are completely different from what I obtain from blast webpage. (below link)

https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&BLAST_SPEC=GlobalAln&LINK_LOC=blasttab

for example the alignment results between A0A023W3H0.fasta and A0A0H2ZGX7.fasta files with webpage is :

Identities =44/756(6%), Positives = 73/756(9%), Gaps = 529/756(69%)

But with my code is

Identities = 6/14 (43%), Positives = 8/14 (57%), Gaps = 0/14 (0%)

What should I do? Thanks

alignment biopython blastp • 3.8k views
ADD COMMENT
0
Entering edit mode

Any another suggestion? Anybody can help me?

ADD REPLY
0
Entering edit mode

So this issue is not fixed? Why did you accept the answer from Mensur? Accepting an answer indicates that your problem has been fixed.

You will need to provide the source files (fasta files you are using) for people to try and reproduce the problem. Use pastebin.com to post sequences.

ADD REPLY
0
Entering edit mode

Unfortunately the problem still exists. Excuse me. I though accepting answer means I read and tried it! However, I posted the link of two of fasta file to easily download them from uniprot site. The links are:

https://www.uniprot.org/uniprot/A0A0E1R3H3.fasta and https://www.uniprot.org/uniprot/A0A0H2UQB5.fasta

I tried all codes with my sequences and no changes happened in the results. Even I tried to change format of sequences in order to fix the problem but I didn't succeed. What should I do? Thanks

ADD REPLY
0
Entering edit mode

I though accepting answer means I read and tried it!

No. Accepting an answer means it actually solved the problem described in original post.

ADD REPLY
4
Entering edit mode
2.7 years ago
Mensur Dlakic ★ 28k

What you are trying to do is fairly simple, and you are complicating it by: 1) not providing your sequences so that someone can reproduce your attempt; 2) giving a result in a form that is impossible to read. Be honest, can you make any sense of the result you posted above?

Here are my two sequences:

::::::::::::::
p13_hs.fas
::::::::::::::
>NP_001018025.1 protein p13 MTCP-1 [Homo sapiens]
MAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQQIQVPLGDAARPSHLLTSQLPLMWQ
LYPEERYMDNNSRLWQIQHHLMVRGVQELLLKLLPDD
::::::::::::::
p13_mi.fas
::::::::::::::
>XP_019483182.1 PREDICTED: protein p13 MTCP-1 [Hipposideros armiger]
MSGEDVGPPPDHLWVHQEGIYRDEYQRTWVAVLEEDTNFLRARVQQVQVPLGDAARPSHLLTSQLPLMWQ
LYPEERYMDNNSRLWQIQHHLMVRGVQELLLKLLPDD

After I paste them in, I select all the lines I just pasted, and click on the button above that says 101/010 and they get formatted in a way that is legible.

Here is my code, and I will also format it as above so it can be read:

from Bio.Blast.Applications import NcbiblastpCommandline
from Bio import SeqIO

fasta_file1 = 'p13_hs.fas'
fasta_file2 = 'p13_mi.fas'
seq1 = SeqIO.read(fasta_file1, "fasta")
seq2 = SeqIO.read(fasta_file2, "fasta")
output = NcbiblastpCommandline(query=fasta_file1, subject=fasta_file2, outfmt=0)()[0]
print(output)

And finally here is the output from running the above script:

BLASTP 2.11.0+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.



Database: User specified sequence set (Input: p13_mi.fas).
           1 sequences; 107 total letters



Query= NP_001018025.1 protein p13 MTCP-1 [Homo sapiens]

Length=107
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

XP_019483182.1 PREDICTED: protein p13 MTCP-1 [Hipposideros armiger]   210     1e-77


> XP_019483182.1 PREDICTED: protein p13 MTCP-1 [Hipposideros armiger]
Length=107

 Score = 210 bits (535),  Expect = 1e-77, Method: Compositional matrix adjust.
 Identities = 101/107 (94%), Positives = 106/107 (99%), Gaps = 0/107 (0%)

Query  1    MAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQQIQVPLGDAARPSHL  60
            M+GEDVG PPDHLWVHQEGIYRDEYQRTWVAV+EE+T+FLRARVQQ+QVPLGDAARPSHL
Sbjct  1    MSGEDVGPPPDHLWVHQEGIYRDEYQRTWVAVLEEDTNFLRARVQQVQVPLGDAARPSHL  60

Query  61   LTSQLPLMWQLYPEERYMDNNSRLWQIQHHLMVRGVQELLLKLLPDD  107
            LTSQLPLMWQLYPEERYMDNNSRLWQIQHHLMVRGVQELLLKLLPDD
Sbjct  61   LTSQLPLMWQLYPEERYMDNNSRLWQIQHHLMVRGVQELLLKLLPDD  107



Lambda      K        H        a         alpha
   0.321    0.136    0.431    0.792     4.96

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6

Effective search space used: 9025


  Database: User specified sequence set (Input: p13_mi.fas).
    Posted date:  Unknown
  Number of letters in database: 107
  Number of sequences in database:  1



Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Neighboring words threshold: 11
Window for multiple hits: 40

So if something like this doesn't work for you, it is either because the sequences are not formatted properly (we don't know, because you never showed us), or maybe your BioPython version is not recent (we don't know, you never told us). Other than that, the script should do what is expected.

ADD COMMENT
0
Entering edit mode

You are right. I use fasta files that | downloaded from "https://www.uniprot.org/uniprot/" and use the address of downloaded files in NcbiblastpCommandline() directly.

For example A0A0E1R3H3.fasta file (downloaded form https://www.uniprot.org/uniprot/A0A0E1R3H3.fasta):

>tr|A0A0E1R3H3|A0A0E1R3H3_LISMN Penicillin-binding protein 3 
OS=Listeria monocytogenes serotype  4b str. LL195 OX=1230340 GN=pbpC PE=3 SV=1 
MKGCIMASYGGKKRNNKKAIIIGSIAAVAVLAGIAIYFFIQNQHKDEKNALAAAETFTSN
IAKEKYDKLGSSVSSESLKKVEFTKKEMEDKYQAVYDGIGAKDIKVKNLKSVYDDKENKF 
NLTYELEMRTSLGKLATQKYKTTISKQDDDWKIDWKPALIFPGMVKTDKVRITEDEAERG 
QIVDRNGNPLATTGQFAEAGVVPSKLGEGDEKVKNIADISKKLEVSTDYINKQLDQKWVQ 
ADSFVPLVTLNKDNLPEATGLTYAQKEMRTYPLNEATSHLIGYVGEVSAEDIEKNPKLSV 
GDVIGKSGLERYYDKQLRGKNGGAIKIINDQTKQEDTLQKIDKKDGEEIKLTIDAAVQKK 
AFDSLGSETGAVTMINPTNGELLALVSTPSYDANQMVLGITSEDYAKYNDDKRLPFLARY 
ANRYAPGSTFKTITATIGLDTGVTKPDKVREISGLQWQKDASWGKYFVTRVHDVPKVNMT 
DALVHSDNIYFAQEGLEIGKDKLTAGLKKFDFDKEYNLPFTMKPAQISNDGLNSEILLAD 
TSYGQGELLMSPIQQAIAYSAIASDGKMPYPKLDTKEKAGKETQATEAASANQVKAALVK 
TVSDPAGTAHALQIQGHSIAAKTGTAELKEKQGEDGLENGFVYAFDADNPNYLMVGMIEN 
VKGRGGSGLVIDKLKPVIESMYK

A0A0H2UQB5.fasta file (downloaded from https://www.uniprot.org/uniprot/A0A0H2UQB5.fasta):

>tr|A0A0H2UQB5|A0A0H2UQB5_STRPN LicC protein 
OS=Streptococcus pneumoniae serotype 4 (strain ATCC BAA-334 / TIGR4)    
OX=170187 GN=licC PE=1 SV=1 
MKAIILAAGLGTRLRPMTENTPKALVQVNQKPLIEYQIEFLKEKGINDIIIIVGYLKEQF 
DYLKEKYGVRLVFNDKYADYNNFYSLYLVKEELANSYVIDADNYLFKNMFRNDLTRSTYF 
SVYREDCTNEWFLVYGDDYKVQDIIVDSKAGRILSGVSFWDAPTAEKIVSFIDKAYVSGE 
FVDLYWDNMVKDNIKELDVYVEELEGNSIYEIDSVQDYRKLEEILKNEN

And code is:

output = NcbiblastpCommandline(query=A0A0E1R3H3_FileAdress, subject=A0A0H2UQB5_fileAdress, outfmt=0)()[0]

and I tried this :

output =NcbiblastpCommandline(query=A0A0E1R3H3_FileAdress, subject=A0A0H2UQB5_fileAdress, ungapped=False)()

However the results with code is:

 Query= tr|A0A023W3H0|A0A023W3H0_KLEPN Class D OXA-48 carbapenemase (Fragment) 
 OS=Klebsiella pneumoniae OX=573 GN=OXA-48 PE=4 SV=1

  Length=145
                                            Score  (Bits)       E Value 
Sequences producing significant alignments:  13.5             6.0                 

 tr|O08355|O08355_PSEFL Mannitol dehydrogenase OS=Pseudomonas fluo...         


 > tr|O08355|O08355_PSEFL Mannitol dehydrogenase OS=Pseudomonas  fluorescens 
OX=294 GN=mtlD PE=1 SV=1 Length=493

 Score = 13.5 bits (23),  Expect = 6.0, Method: Composition-based stats. 
 Identities = 11/31 (35%), Positives = 12/31 (39%), Gaps = 3/31 (10%)

  Query  105  RIVKQAMLTEANGDYIIRAKTGYSTRIEPKI  135
               RIV    LT   G Y I    G      P+I 
  Sbjct  125  RIVS---LTITEGGYCIDDSNGEFMAHLPQI  152


  Score = 13.1 bits (22),  Expect = 6.6, Method: Composition-based stats.
  Identities = 3/6 (50%), Positives = 4/6 (67%), Gaps = 0/6 (0%)

  Query  132  EPKIGW  137
              EP + W Sbjct  259  EPFVQW  264



   Lambda      K        H         a      alpha
   0.320    0.135    0.421    0.792     4.96 

   Gapped Lambda   K        H     a      alpha   sigma
        0.267   0.0410   0.140   1.90     42.6     43.6 

    Effective search space used: 56160
ADD REPLY
0
Entering edit mode
2.7 years ago
sure ▴ 110

There is a length discrepancy between the 2 methods. The online option is suggesting that the sequence is 756 whereas the other one is suggesting 14. I would investigate the seq1 and seq2 objects in python to make sure they are of the correct length and also the input fasta files. Hope this helps.

ADD COMMENT
0
Entering edit mode

Thank you very much for replying. Of course I noticed this difference and tried to fix it but I couldn't. In offline mode it shows the real length of two sequences but unfortunately use a short length for alignment. what can I do? thanks

ADD REPLY
0
Entering edit mode
2.7 years ago
Mensur Dlakic ★ 28k

It appears that your local command is creating an ungapped alignment, while the remote will always create a gapped alignment by default. See if this command helps:

output = NcbiblastpCommandline(query=seq1, subject=seq2, ungapped=False, outfmt=0)()[0]

By the way, you would have made this easier to troubleshoot by posting alignments from both commands.

ADD COMMENT
0
Entering edit mode

Thank you Mensur Dlakic for response. I tried it but the results didn't change properly.

For Example for A0A023W3H0.fasta and A0A0H2UQB5.fasta the Web Results :

Query: tr|A0A023W3H0|A0A023W3H0_KLEPN Class D OXA-48 carbapenemase (Fragment) OS=Klebsiella pneumoniae 
OX=573 GN=OXA-48 PE=4 SV=1 Query ID: lcl|Query_551911 Length: 145


>tr|A0A0H2UQB5|A0A0H2UQB5_STRPN LicC protein 
OS=Streptococcus pneumoniae serotype 4 (strain ATCC BAA-334 / TIGR4) 
OX=170187 GN=licC PE=1 SV=1 Sequence ID: Query_551913 Length: 229 Range 1: 1 to 229

<table class=alnParams> <caption class=hdnHeader> 

  NW Score = -109,Identities = 28/236(12%), Positives = 62/236(26%), Gaps =98/236(41%)

whereas the code result:

 BLASTP 2.13.0+
Query= tr|A0A023W3H0|A0A023W3H0_KLEPN Class D OXA-48 carbapenemase
 (Fragment) OS=Klebsiella pneumoniae OX=573 GN=OXA-48 PE=4 SV=1

> tr|A0A0H2UQB5|A0A0H2UQB5_STRPN LicC protein OS=Streptococcus pneu.. 14.6    1.1  

 > tr|A0A0H2UQB5|A0A0H2UQB5_STRPN LicC protein 
OS=Streptococcus  pneumoniae serotype 4 (strain ATCC BAA-334 / TIGR4) 
OX=170187  GN=licC    PE=1 SV=1 Length=229

  Score = 14.6 bits (26),  Expect = 1.1, Method: Compositional matrix adjust. 
 Identities = 9/23 (39%), Positives = 13/23 (57%), Gaps = 4/23 (17%)

Any other suggestion? Thanks

ADD REPLY
0
Entering edit mode

laila.jafari : Please do not use " button in edit bar when formatting parts of post that needs to stay mono-spaced or aligned. Your post is impossible to format (I tried it) and as posted it is difficult to see what the result looks like. Please edit the post above and use 101010 button to format the text.

ADD REPLY
0
Entering edit mode

Hi, I edited the whole text. Hope the problem is fixed.

ADD REPLY
0
Entering edit mode

Not it is not. Compare your post to the one from Mensur. Even the result that you posted needs to be formatted with 101 button.

ADD REPLY
0
Entering edit mode

Ok, thank you very much. I am beginner and it is my first post in this webpage. so I apologize you for this mistakes.

ADD REPLY
0
Entering edit mode

No problem. I think there is something about the way you copy and pasted the results. For example this sections looks like

> tr|O08355|O08355_PSEFL Mannitol dehydrogenase OS=Pseudomonas  fluorescens OX=294 GN=mtlD PE=1 SV=1 Length=493

 Score = 13.5 bits (23),  Expect = 6.0, Method: Composition-based stats.  Identities = 11/31 (35%), Positives = 12/31 (39%), Gaps = 3/31 (10%)

Query  105  RIVKQAMLTEANGDYIIRAKTGYSTRIEPKI  135
            RIV    LT   G Y I    G      P+I Sbjct  125  RIVS---LTITEGGYCIDDSNGEFMAHLPQI  152


 Score = 13.1 bits (22),  Expect = 6.6, Method: Composition-based stats.  Identities = 3/6 (50%), Positives = 4/6 (67%), Gaps = 0/6 (0%)

Query  132  EPKIGW  137
            EP + W Sbjct  259  EPFVQW  264

Where as it needs to look like this to see the alignments properly.

> tr|O08355|O08355_PSEFL Mannitol dehydrogenase OS=Pseudomonas  fluorescens OX=294 GN=mtlD PE=1 SV=1 Length=493

 Score = 13.5 bits (23),  Expect = 6.0, Method: Composition-based stats.  Identities = 11/31 (35%), Positives = 12/31 (39%), Gaps = 3/31 (10%)

Query  105  RIVKQAMLTEANGDYIIRAKTGYSTRIEPKI  
       135  RIV    LT   G Y I    G      P+I 
Sbjct  125  RIVS---LTITEGGYCIDDSNGEFMAHLPQI  152

 Score = 13.1 bits (22),  Expect = 6.6, Method: Composition-based stats.  Identities = 3/6 (50%), Positives = 4/6 (67%), Gaps = 0/6 (0%)
Query  132  EPKIGW  
       137  EP + W 
Sbjct  259  EPFVQW  264
ADD REPLY
0
Entering edit mode

Thank you for your help. Hope all mistakes fixed.

ADD REPLY
0
Entering edit mode
2.7 years ago
Mensur Dlakic ★ 28k

I admire your persistence, but really: is this so important that two weeks later you still can't let go and move on to the next problem?

But since you are so persistent, here is a solution to your problem. When you want to do 2 sequence alignment with BLAST, this is the link. When you go there, paste your two sequences and increase the E-value to 10, you will get exactly the same alignment as with the local blast version.

What you have linked above is a global alignment using Needleman-Wunsch algorithm, which is different from what BLAST does ("LA" in BLAST stands for local alignment). That is why the results are different because your local BLAST copy is doing a local alignment, and the link above is doing a global alignment.

ADD COMMENT
0
Entering edit mode

Thank you for your appreciate. I persist because I need it's results for my PhD project. I know I can use this link but I need about 3500,000 alignments that needs a code to do it! Even with web-scarping, it takes about one year to get results! So I need offline code. I thought BLAST is global! I need global alignment and I couldn't find proper code that can get me the results of the second link.

ADD REPLY
3
Entering edit mode

I need global alignment and I couldn't find proper code that can get me the results of the second link.

You can use the program needle from EMBOSS suite. There is another program called stretcher in EMBOSS suite that can also be used. You can download EMBOSS suite locally.

ggsearch program from FASTA suite (LINK) (not to be confused with Fasta format) generates global alignments. You can download FASTA suite.

NW-align can be downloaded locally: https://zhanggroup.org/NW-align/

ADD REPLY
0
Entering edit mode

Hi. I couldn't open any download link in EMBOSS suite website! But I could use Fasta suite link. Thank you very much

ADD REPLY

Login before adding your answer.

Traffic: 4433 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6