I'm trying to abuse biopython for some non-bioinformatics problem. I'm trying to align general sentences to find common pattern. In order to do so, I'm breaking the sentences into list of strings and using pairwise2.
Example (no real sentences): list1 = ["MEEP", "QS", "DPSV", "EPPLS"] list2 = ["MEES", "QS", "DISL", "EPPLS"]
Using globalxs: alns = pairwise2.align.globalxs(list1, list2, -10, -0.5, gap_char=['-']) top_aln = alns[0] aln_1, aln_2, score, begin, end = top_aln
The alignment resulted is not exactly what I would've hoped for: ['P', 'E', 'E', 'M', 'S', 'Q', 'V', 'S', 'P', 'D', 'S', 'L', 'P', 'P', 'E'] ['S', 'E', 'E', 'M', 'S', 'Q', 'L', 'S', 'I', 'D', 'S', 'L', 'P', 'P', 'E']
Tried to debug the source code. I'm not totally fluent in Python, so it was hard to follow all the hidden calls to different functions, but it seems that everything is fine with the strings comparison (happens when __call__ of class identity_match is being called in line 901), so I guess the matching matrix is fine. The problem is probably in the backtracing for building the alignment, where I keep getting into this block (line 749): elif trace % 4 == 2: # = match/mismatch of seqA with seqB trace -= 2 row -= 1 col -= 1 ali_seqA += sequenceA[row] ali_seqB += sequenceB[col] col_gap = False so I get no gaps. It's my first run with BioPython, so I guess it's something with my configuration. Splitting the strings and reversing them though, I really don't get...