Hello there, I'm the author of the Python library "pyalign" for computing alignments, and I would be interested in getting feedback from experts on alignment algorithms in terms of the design of the library (since I'm not an expert myself).
In short, I would be interested in learning whether there are any glaring mistakes/omissions in the library's public API.
The library is https://github.com/poke1024/pyalign. It supports global, local and semiglobal alignments, various gap costs and implements the usual classical algorithms (Smith-Waterman, Needleman-Wunsch, Gotoh for affine alignments). The computations are done in a C++17 backend, that heavily relies of templates for code optimization. In terms of single-thread CPU performance (i.e. not comparing to more advanced SIMD or GPU implementations) it should be pretty fast.
Note that the library was not built for a bioinformatics context, but for users who need versatile alignment algorithms for other domains.
A more in-detail notebook is available under https://mybinder.org/v2/gh/poke1024/pyalign-demo/HEAD?filepath=example.ipynb
In functionality, the library is mostly similar to https://biopython.org/docs/1.75/api/Bio.pairwise2.html
One special feature of pyalign (and the main reason for its existence) is that it can deal with large alphabets, like millions of different letters.
Thanks for sharing ! It seems there is a problem to install on Kaggle cloud (like a Colab): I tried several ways: https://www.kaggle.com/code/alexandervc/pyalign-install-problem . Would you be so kind to suggest a way ? Now it might of interest for Kaggle community since of ongoing CAFA5 challenge: https://www.kaggle.com/competitions/cafa-5-protein-function-prediction
Resolved: Bernhard Liebl (author of "pyalign") kindly provided a way to install it on Kaggle: https://www.kaggle.com/code/lieblb/pyalign-example/notebook
Comments from the author:
"installing via pip seems to fail because the xtensor libraries are missing
I managed to get it to run, however unfortunately it takes 8 minutes to compile:"
Out of curiosity: why is 8 minutes a problem when the notebooks are allowed to run for 9 hours?
Glad to hear you, hope to see your submits on CAFA5 ) First it is quote from the author, not my. Second in my mind it is indeed small incovenience in general when packages are installing so long - when you are trying to make fast experiments - some idea comes to your mind - you open Kaggle notebook and will need to wait 8 minutes... sometimes you will prefer another package.... at least it is my feelings ...
Does it support substitution matrices - e.g. blossum62 - for local protein alignments ? Does it support different gaps for gat open, and gap continuation ? (for blastP typical choice is 11,1 - see e.g. NCBI web site)