Entering edit mode
4.6 years ago
c.doorenweerd
▴
10
Hi,
I am trying to count the number of haplotypes in a group of sequences using biopython. My sequences have IUPAC ambiguities, which should not count as differences. E.g.:
ACTYG == ACTCG should be: True
I've tried to compare using == but biopython seems to read the sequences as strings for this comparison and does not incorporate the IUPAC codes?
I've also tried to use
hapcount = len(recordlist)
for a, b in itertools.combinations(recordlist, 2):
if calculator._pairwise(a, b) == 0:
print("match")
hapcount -= 1
but the pairwise comparison seems to count IUPAC codes as differences too??
Any ideas on how to do this would be much appreciated. It seems like something simple.
The answer to your question is not that trivial, because we can't say the two sequences are "equal", but rather "compatible". For example:
ACT and ACA -> False
ACT and ACN -> True
ACY and ACN -> True
ACY and ACR -> False
ACY and ACM -> Maybe
I don't know the format of your sequences, but maybe you can use this:
In my case, it says there is a match.
Thank you very much! You made me rethink what I was asking. I was not necessarily looking for sequences that are compatible, but sequences that are confidently 'unique'; so that ambiguities would not count as differences. I managed to write this function that does what I need: