I have an application in which I need to align sequences of words. The application is not related to bioinformatics, but I was hoping to be able to leverage Biopython's support for sequence alignment. I however ran into a TypeError when I tried to create a sequence of words:
In [1]: from Bio.Seq import Seq
In [2]: seq_1 = Seq(['abc', 'def', 'ghi', 'jkl'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-567b8a62bb5f> in <module>()
----> 1 seq_1 = Seq(['abc', 'def', 'ghi', 'jkl'])
/usr/local/lib/python2.7/dist-packages/Bio/Seq.pyc in __init__(self, data, alphabet)
104 # Enforce string storage
105 if not isinstance(data, basestring):
--> 106 raise TypeError("The sequence data given to a Seq object should "
107 "be a string (not another Seq object etc)")
108 self._data = data
TypeError: The sequence data given to a Seq object should be a string (not another Seq object etc)
I was curious as to whether there is a reason Biopython enforces sequences to be supplied as a string, essentially supporting alignment only on sequences of characters. Is this because in the domain of bioinformatics only sequences of RNA, DNA and protein, all of which can be encoded as sequences of characters, ever need to be aligned? Or is there a more subtle reason (possibly performance) that mandates this choice?
Just out of curiosity, what does "align" mean to you in this context? The various sequence aligners are all built on classic string-matching algorithms, which of course necessitate a string-like datatype.