Question

python: find percentage of match between two sequences

1

Entering edit mode

8.3 years ago

Am.A ▴ 20

Hi all,

I want to find percentage of match between two sequences. I try to use get_matching_blocks(), but It does not work for large sequence, for example 1000 characters

x1="A................"
y1="T................"
s = SequenceMatcher(None, x1, y1)

result = s.get_matching_blocks()

is there any alternative method?

sequence • 18k views

ADD COMMENT • link updated 8.3 years ago by Markus ▴ 320 • written 8.3 years ago by Am.A ▴ 20

score 3 · Answer 1 · 2016-08-24

3

Entering edit mode

8.3 years ago

Medhat 9.8k

you could use

needle from biopython to calculate similarity like this

def needle_align_code(query_seq, target_seq):
    needle_cline = NeedleCommandline(asequence="asis:" + query_seq,
                                     bsequence="asis:" + target_seq,
                                     aformat="simple",
                                     gapopen=10,
                                     gapextend=0.5,
                                     outfile='stdout'
                                     )
    out_data, err = needle_cline()
    out_split = out_data.split("\n")
    p = re.compile("\((.*)\)")
    return p.search(out_split[25]).group(1).replace("%", "")

also you could calculate the hamming distance "if they are the same length" like

   from itertools import izip
   def hamming_distance(str1, str2):
         assert len(str1) == len(str2)
         return sum(chr1 != chr2 for chr1, chr2 in izip(str1, str2))

Or use levenshtein distance

more implementation here

Good luck :)

ADD COMMENT • link 8.3 years ago by Medhat 9.8k

1

Entering edit mode

The Distance library for Python also has an implementation of hamming distance and some other metrics which I've found useful in the past.

ADD REPLY • link 8.3 years ago by keith.hughitt ▴ 280

0

Entering edit mode

I think even it could be better than some of my suggestions

ADD REPLY • link 8.3 years ago by Medhat 9.8k

1

Entering edit mode

The 'problem' is that needle is not a Biopython library, Biopython supports commandline access to an existing EMBOSS installation (which contains needle). Thus you need to install EMBOSS beforehand.

ADD REPLY • link 8.3 years ago by Markus ▴ 320

0

Entering edit mode

dear medhat thank you for your response. does needle from biopython work for 2 sequences of different length? (my query sequences around 400 nucleotide and the target sequence is 1500 nucleotide)

ADD REPLY • link 8.3 years ago by Am.A ▴ 20

0

Entering edit mode

I think in this case you need local alignment not global, here is the code:

def water_align(query_seq, target_seq):
        needle_cline = WaterCommandline(asequence="asis:" + query_seq,
                                        bsequence="asis:" + target_seq,
                                        aformat="simple",
                                        gapopen=10,
                                        gapextend=0.5,
                                        outfile='stdout'
                                        )
        out_data, err = needle_cline()
        out_split = out_data.split("\n")
        p = re.compile("\((.*)\)")
        return p.search(out_split[25]).group(1).replace("%", "")

ADD REPLY • link 8.3 years ago by Medhat 9.8k

0

Entering edit mode

Is it the same if I use blast with local database?

ADD REPLY • link 8.3 years ago by Am.A ▴ 20

0

Entering edit mode

in blast you are aligning the query sequence to all the sequence in a db but here as you can see we are comparing only two sequence

ADD REPLY • link 8.3 years ago by Medhat 9.8k

0

Entering edit mode

This is a great answer. Could you mention the import command that needs to go at the beginning of this code? I think we would need to import Biopython and Needle?

ADD REPLY • link 4.3 years ago by c_u ▴ 520

1

Entering edit mode

from Bio.Emboss.Applications import NeedleCommandline
from Bio.Emboss.Applications import WaterCommandline

For more please visit https://homolog.us/Biopython/api-reference.html

ADD REPLY • link 4.3 years ago by Medhat 9.8k

score 3 · Answer 2 · 2016-08-27

Percentage of match between two sequences is possible not what you want, you are very likely interested in percentage of macht between two aligned sequences. That opens the question how to align your sequences (which algorithm to use). And consider this case:

seq_1 : ATGGATCATTGA
seq_2:  ------CATTGA

Is seq_2 100% identical to seq_1? Would you say the same the other way round?

However, here is an example that shows how you may use Biopython to address your problem (of course you need to have Python and Biopython installed):

from __future__ import print_function
from Bio import pairwise2 as pw2

first_seq = 'ATGGATCATTGA'
second_seq = 'CATTGA'

global_align = pw2.align.globalxx(first_seq, second_seq)

print(global_align[0])

The output is:

('ATGGATCATTGA', '------CATTGA', 6.0, 0, 12)

pw2.align2.globalxx makes an optimal global alignment between your two sequences, where every match counts 1 point , while mismatches and insertions/deletions cost nothing.

print(global_align[0]) returns the first alignment (there may be several different alignments with the same score). The third list element (the first number) is the number of matching residues, the other two numbers are the beginning and the end of the alignment. So the alignment has a length of 12 (which, in this case, is also the length of the longer sequences, but usually the alignment will be longer than each of the two sequences), there are 6 matching residues, the shorter sequence has a length of 6. So you may calculate the percentage as follows:

seq_length = min(len(first_seq), len(second_seq))
matches = global_align[0][2]

percent_match = (matches / seq_length) * 100