Hi,
hsp.query
shows the sequence that you are searching for in the BLAST database
hsp.match
shows the matches between the query and the subject
hsp.subject
is the matched part of the sequence in the BLAST database.
For example,
I obtained following portion of the HSP90 protein and BLASTed against Non-redundant protein sequences (nr)
downloaded the result XML and parsed using the code given in the BioPython section you linked (it also shows what other fields are available for alignments).
301 pmgrgtkvil hlkedqteyl eerrikeivk khsqfigypi tlfvekerdk evsddeaeek
361 edkeeekeke ekesedkpei edvgsdeeee kkdgdkkkkk kikekyidqe elnktkpiwt
421 rnpdditnee ygefyksltn dwedhlavkh fsvegqlefr allfvprrap fdlfenrkkk
I additionally printed hsp.sbjct_start
to see or demonstrate this is a local search and it will only show the part they are aligning. Here is the code and its output:
Code:
from Bio.Blast import NCBIXML
with open('J9TJ99ZH015-Alignment.xml') as result_handle:
blast_record = NCBIXML.read(result_handle)
E_VALUE_THRESH = 0.04
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH and 'NP_001017963.2' in alignment.title:
print('****Alignment****')
print('sequence:', alignment.title)
print('length:', alignment.length)
print('e value:', hsp.expect)
print(hsp.query[0:75] + '...')
print(hsp.match[0:75] + '...')
print(hsp.sbjct[0:75] + '...')
print('Subject start', hsp.sbjct_start)
Output:
Gungors-MacBook-Pro:Desktop gungor$ python blast.py
****Alignment****
sequence: gi|153792590|ref|NP_001017963.2| heat shock protein HSP 90-alpha isoform 1 [Homo sapiens]
length: 854
e value: 4.29656e-116
PMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEKERDKEVSDDEAEEKEDKEEEKEKEEKESE...
PMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEKERDKEVSDDEAEEKEDKEEEKEKEEKESE...
PMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEKERDKEVSDDEAEEKEDKEEEKEKEEKESE...
Subject start 301
As you see it shows from the part I obtained from NCBI. It doesn't show the complete sequence.
Up to this point, I hope the concepts are more clear. Now, to get the full sequence of the matched protein, you might get the accession number from the alignment and search any database using that accession number either through an API or you download all, say, human protein sequences and parse file and extract the full sequence from there.
For the API part, you may again use BioPython for Entrez utilities (just strip the ID, and put it to efetch
, see below):
from Bio import Entrez
Entrez.email = "your@email.com"
handle = Entrez.efetch(db="protein", id="NP_001017963.2", rettype="fasta", retmode="text")
print(handle.read())
And the output is a FASTA file with the full sequence:
NP_001017963.2 heat shock protein HSP 90-alpha isoform 1 [Homo sapiens]
MPPCSGGDGSTPPGPSLRDRDCPAQSAEYPRDRLDPRPGSPSEASSPPFLRSRAPVNWYQEKAQVFLWHL
MVSGSTTLLCLWKQPFHVSAFPVTASLAFRQSQGAGQHLYKDLQPFILLRLLMPEETQTQDQPMEEEEVE
TFAFQAEIAQLMSLIINTFYSNKEIFLRELISNSSDALDKIRYESLTDPSKLDSGKELHINLIPNKQDRT
LTIVDTGIGMTKADLINNLGTIAKSGTKAFMEALQAGADISMIGQFGVGFYSAYLVAEKVTVITKHNDDE
QYAWESSAGGSFTVRTDTGEPMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEKERDK
EVSDDEAEEKEDKEEEKEKEEKESEDKPEIEDVGSDEEEEKKDGDKKKKKKIKEKYIDQEELNKTKPIWT
RNPDDITNEEYGEFYKSLTNDWEDHLAVKHFSVEGQLEFRALLFVPRRAPFDLFENRKKKNNIKLYVRRV
FIMDNCEELIPEYLNFIRGVVDSEDLPLNISREMLQQSKILKVIRKNLVKKCLELFTELAEDKENYKKFY
EQFSKNIKLGIHEDSQNRKKLSELLRYYTSASGDEMVSLKDYCTRMKENQKHIYYITGETKDQVANSAFV
ERLRKHGLEVIYMIEPIDEYCVQQLKEFEGKTLVSVTKEGLELPEDEEEKKKQEEKKTKFENLCKIMKDI
LEKKVEKVVVSNRLVTSPCCIVTSTYGWTANMERIMKAQALRDNSTMGYMAAKKHLEINPDHSIIETLRQ
KAEADKNDKSVKDLVILLYETALLSSGFSLEDPQTHANRIYRMIKLGLGIDEDDPTADDTSAAVTEEMPP
LEGDDDTSRMEEVD
Thanks so much dude! Really appreciate the well structured and thorough answer. Have a nice day.