Question

Biopython pairwise sequence comparison using IUPAC?

1

Entering edit mode

4.6 years ago

c.doorenweerd ▴ 10

Hi,

I am trying to count the number of haplotypes in a group of sequences using biopython. My sequences have IUPAC ambiguities, which should not count as differences. E.g.:

ACTYG == ACTCG should be: True

I've tried to compare using == but biopython seems to read the sequences as strings for this comparison and does not incorporate the IUPAC codes?

I've also tried to use

hapcount = len(recordlist)
for a, b in itertools.combinations(recordlist, 2):
    if calculator._pairwise(a, b) == 0:
        print("match")
        hapcount -= 1

but the pairwise comparison seems to count IUPAC codes as differences too??

Any ideas on how to do this would be much appreciated. It seems like something simple.

Biopython • 1.9k views

ADD COMMENT • link 4.6 years ago by c.doorenweerd ▴ 10

1

Entering edit mode

The answer to your question is not that trivial, because we can't say the two sequences are "equal", but rather "compatible". For example:

ACT and ACA -> False
ACT and ACN -> True
ACY and ACN -> True
ACY and ACR -> False
ACY and ACM -> Maybe

I don't know the format of your sequences, but maybe you can use this:

from itertools import product
from Bio.Data.IUPACData import ambiguous_dna_values

def ambiguous_dna_list(seq):
    """return list of all possible sequences given an ambiguous DNA input"""
    d = ambiguous_dna_values
    return list(map("".join, product(*map(d.get, seq))))


myseq1 = "ACTYG"
myseq2 = "ACTCG"

seq1_list = ambiguous_dna_list(myseq1)
seq2_list = ambiguous_dna_list(myseq2)

for seq in seq1_list:
    for seq2 in seq2_list:
        if seq == seq2:
            print("match")

In my case, it says there is a match.

ADD REPLY • link 4.6 years ago by lumal29 ▴ 80

0

Entering edit mode

Thank you very much! You made me rethink what I was asking. I was not necessarily looking for sequences that are compatible, but sequences that are confidently 'unique'; so that ambiguities would not count as differences. I managed to write this function that does what I need:

def IUPACdistance_verbose(seq1, seq2):
    ignorelist = ["N","?","-","M","R","W","S","Y","K","V","H","D","B"]
    unamblengthseq1 = len(seq1.translate(str.maketrans('','','N?-MRWSYKVHDB')))
    print("seq1 bp: " + str(unamblengthseq1))
    unamblengthseq2 = len(seq2.translate(str.maketrans('','','N?-MRWSYKVHDB')))
    print("seq2 bp: " + str(unamblengthseq2))
    comparedlength = min(unamblengthseq1, unamblengthseq2)
    print("compared length: " + str(comparedlength))
    difference = 0
    for x, y in zip(seq1.upper(), seq2.upper()):
        if x in ignorelist or y in ignorelist:
            difference += 0
        elif x != y:
            difference += 1
    print("difference:" + str(difference))
    dpairwise = (difference / comparedlength)
    return dpairwise 

seq1 = 'AARCTGACACT'
seq2 = 'TRCCTGN-ACT'

IUPACdistance_verbose(seq1, seq2)

seq1 bp: 10
seq2 bp: 8
compared length: 8
difference:1
Out[14]:
0.125

ADD REPLY • link 4.5 years ago by c.doorenweerd ▴ 10