python: find percentage of match between two sequences
2
1
Entering edit mode
8.3 years ago
Am.A ▴ 20

Hi all,

I want to find percentage of match between two sequences. I try to use get_matching_blocks(), but It does not work for large sequence, for example 1000 characters

x1="A................"
y1="T................"
s = SequenceMatcher(None, x1, y1)

result = s.get_matching_blocks()

is there any alternative method?

sequence • 18k views
ADD COMMENT
3
Entering edit mode
8.3 years ago
Medhat 9.8k

you could use

needle from biopython to calculate similarity like this

def needle_align_code(query_seq, target_seq):
    needle_cline = NeedleCommandline(asequence="asis:" + query_seq,
                                     bsequence="asis:" + target_seq,
                                     aformat="simple",
                                     gapopen=10,
                                     gapextend=0.5,
                                     outfile='stdout'
                                     )
    out_data, err = needle_cline()
    out_split = out_data.split("\n")
    p = re.compile("\((.*)\)")
    return p.search(out_split[25]).group(1).replace("%", "")

also you could calculate the hamming distance "if they are the same length" like

   from itertools import izip
   def hamming_distance(str1, str2):
         assert len(str1) == len(str2)
         return sum(chr1 != chr2 for chr1, chr2 in izip(str1, str2))

Or use levenshtein distance

more implementation here

Good luck :)

ADD COMMENT
1
Entering edit mode

The Distance library for Python also has an implementation of hamming distance and some other metrics which I've found useful in the past.

ADD REPLY
0
Entering edit mode

I think even it could be better than some of my suggestions

ADD REPLY
1
Entering edit mode

The 'problem' is that needle is not a Biopython library, Biopython supports commandline access to an existing EMBOSS installation (which contains needle). Thus you need to install EMBOSS beforehand.

ADD REPLY
0
Entering edit mode

dear medhat thank you for your response. does needle from biopython work for 2 sequences of different length? (my query sequences around 400 nucleotide and the target sequence is 1500 nucleotide)

ADD REPLY
0
Entering edit mode

I think in this case you need local alignment not global, here is the code:

def water_align(query_seq, target_seq):
        needle_cline = WaterCommandline(asequence="asis:" + query_seq,
                                        bsequence="asis:" + target_seq,
                                        aformat="simple",
                                        gapopen=10,
                                        gapextend=0.5,
                                        outfile='stdout'
                                        )
        out_data, err = needle_cline()
        out_split = out_data.split("\n")
        p = re.compile("\((.*)\)")
        return p.search(out_split[25]).group(1).replace("%", "")
ADD REPLY
0
Entering edit mode

Is it the same if I use blast with local database?

ADD REPLY
0
Entering edit mode

in blast you are aligning the query sequence to all the sequence in a db but here as you can see we are comparing only two sequence

ADD REPLY
0
Entering edit mode

This is a great answer. Could you mention the import command that needs to go at the beginning of this code? I think we would need to import Biopython and Needle?

ADD REPLY
1
Entering edit mode
from Bio.Emboss.Applications import NeedleCommandline
from Bio.Emboss.Applications import WaterCommandline

For more please visit https://homolog.us/Biopython/api-reference.html

ADD REPLY
3
Entering edit mode
8.3 years ago
Markus ▴ 320

Percentage of match between two sequences is possible not what you want, you are very likely interested in percentage of macht between two aligned sequences. That opens the question how to align your sequences (which algorithm to use). And consider this case:

seq_1 : ATGGATCATTGA
seq_2:  ------CATTGA

Is seq_2 100% identical to seq_1? Would you say the same the other way round?

However, here is an example that shows how you may use Biopython to address your problem (of course you need to have Python and Biopython installed):

from __future__ import print_function
from Bio import pairwise2 as pw2

first_seq = 'ATGGATCATTGA'
second_seq = 'CATTGA'

global_align = pw2.align.globalxx(first_seq, second_seq)

print(global_align[0])

The output is:

('ATGGATCATTGA', '------CATTGA', 6.0, 0, 12)

pw2.align2.globalxx makes an optimal global alignment between your two sequences, where every match counts 1 point , while mismatches and insertions/deletions cost nothing.

print(global_align[0]) returns the first alignment (there may be several different alignments with the same score). The third list element (the first number) is the number of matching residues, the other two numbers are the beginning and the end of the alignment. So the alignment has a length of 12 (which, in this case, is also the length of the longer sequences, but usually the alignment will be longer than each of the two sequences), there are 6 matching residues, the shorter sequence has a length of 6. So you may calculate the percentage as follows:

seq_length = min(len(first_seq), len(second_seq))
matches = global_align[0][2]

percent_match = (matches / seq_length) * 100
ADD COMMENT

Login before adding your answer.

Traffic: 2072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6