Percentage of match between two sequences is possible not what you want, you are very likely interested in percentage of macht between two aligned sequences. That opens the question how to align your sequences (which algorithm to use). And consider this case:
seq_1 : ATGGATCATTGA
seq_2: ------CATTGA
Is seq_2 100% identical to seq_1? Would you say the same the other way round?
However, here is an example that shows how you may use Biopython to address your problem (of course you need to have Python and Biopython installed):
from __future__ import print_function
from Bio import pairwise2 as pw2
first_seq = 'ATGGATCATTGA'
second_seq = 'CATTGA'
global_align = pw2.align.globalxx(first_seq, second_seq)
print(global_align[0])
The output is:
('ATGGATCATTGA', '------CATTGA', 6.0, 0, 12)
pw2.align2.globalxx
makes an optimal global alignment between your two sequences, where every match counts 1 point , while mismatches and insertions/deletions cost nothing.
print(global_align[0])
returns the first alignment (there may be several different alignments with the same score). The third list element (the first number) is the number of matching residues, the other two numbers are the beginning and the end of the alignment. So the alignment has a length of 12 (which, in this case, is also the length of the longer sequences, but usually the alignment will be longer than each of the two sequences), there are 6 matching residues, the shorter sequence has a length of 6. So you may calculate the percentage as follows:
seq_length = min(len(first_seq), len(second_seq))
matches = global_align[0][2]
percent_match = (matches / seq_length) * 100
The Distance library for Python also has an implementation of hamming distance and some other metrics which I've found useful in the past.
I think even it could be better than some of my suggestions
The 'problem' is that
needle
is not a Biopython library, Biopython supports commandline access to an existing EMBOSS installation (which containsneedle
). Thus you need to install EMBOSS beforehand.dear medhat thank you for your response. does needle from biopython work for 2 sequences of different length? (my query sequences around 400 nucleotide and the target sequence is 1500 nucleotide)
I think in this case you need local alignment not global, here is the code:
Is it the same if I use blast with local database?
in blast you are aligning the query sequence to all the sequence in a db but here as you can see we are comparing only two sequence
This is a great answer. Could you mention the import command that needs to go at the beginning of this code? I think we would need to import Biopython and Needle?
For more please visit https://homolog.us/Biopython/api-reference.html