Hello,
I am trying to use sequence alignment tools to align two text strings of length 500,000. I tried using the pairwise2 package from Biopython to perform a global alignment (with one_alignment_only = True), but the process is too time-consuming and memory-intensive. I have read that Needleman-Wunsch is O(n^2) (both in terms of processing time and memory required). This is confusing to me, as I believe biologists must align DNA sequences much longer than this. If someone could describe how biologists get around this issue and/or suggest any common packages (especially written in python) that are used to align such long sequences efficiently, I would be grateful. Any guidance would be greatly appreciated. Thank you.
Thank you very much! My main difficulty with BLAST is it only seems to accept a restricted alphabet. I am especially interested in tools that allow me to align text with any ascii characters.
If you're interested in computing distances between strings which contain any ASCII characters, you may be better off asking this on StackExchange, as a generic programming/computer science question. There will be few, if any, bioinformatics alignment tools that will cope with all possible characters, most would just be restricted to
[A,C,T,G,N,-]
for DNA and maybe ~25 amino acid characters if they support some of the more unsual ones.