Hello,
I'd like to use a multi sequence alignment algorithm, such as MUSCLE or CrustalW. But I'm not working with genetic data and my sequences are not composed of letters : each element of each sequence is a vector of length n. n is a the number of variables.
Here is an example : Genetic sequence of length 4 : 'ACCG'. Sequence of length 4 (and n=3) in my problem : [[1.5, 2.8, 2.3], [2.5, 2.1, 7.9], [5.2, 2.4, 2.7], [2.4, 2.6, 5.1]].
alignment algorithms use a score system to evaluate the alignment quality at each position. For example, if 'A' meets 'A', the score is +1, and if 'A' meets 'C', score is -1. In my problem, I want to use the euclidian distance to evaluate the score of an alignement (and my goal is to minimize these distances along the alignment).
Example 1 : [1.5, 2.8, 2.3] vs [1.6, 2.7, 2.3] : euclidian distance = 0.14. It's a good score Example 2 : [1.5, 2.8, 2.3] vs [2.9, 2.4, 5.2] : euclidian distance = 3.24. It's not a good score
I'd like to use this algorithm in Python. I know that MUSCLE and CrustalW have been implemented in Python, in the Biopython library for example. But these libraries are generally designed to deal with DNA/protein sequences of letters.
Do you know how I could use one of these algorithms for my problem ? Is it possible to change the scoring system in the Biopython library for example ? Or do I need to rewrite the algorithm myself ? Thanks.
You are right that you wont be able to use established alignment programs for this. CLUSTAL and MUSCLE are not 'implemented in python', rather, the biopython modules are intefaces/wrappers around the normal commandline binaries, so you cannot simply 'hack this'.
This isn't an alignment problem as you have posed it, because all your examples are the same length, i.e. vectors of length 4. There is nothing to align here.
If you need to calculate the euclidean distance between sets of vectors, there are established methods for doing this, particularly in the
numpy
module:https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy
If there is some alignment component to this that the question doesn't make apparent, you can simply use an implementation of Smith-Waterman or Needleman-Wunsch (you haven't said whether you need local or global alignment) for which you should be able to find some python code online quite easily.
Thanks for your answer.
In my example, I show a sequence of length 4, composed of vectors of size n=3 each. I want to align it with several other sequences of different lengths, but all containing vectors of size n=3. So I don't want to align or modify vectors, I want to align sequences of vectors. It's why I explained that my vector of size 3 is the equivalent of a nucleotide, not a sequence ! [[1.5, 2.8, 2.3], [2.5, 2.1, 7.9], [5.2, 2.4, 2.7], [2.4, 2.6, 5.1]] is 1 sequence, not 4 different sequences to align.
Needleman-Wunsch is used for pairwise alignment. And as I said, I want to make multi sequence alignment, not pairwise alignment. Indeed, it's easy to find a python script for this alorithm, and do some modifications to make it work on my problem. I alrealy did that. Now, I'd like to do something similar, but multi-sequence alignment.
Also, I don't understand why you say it's a problem to align sequences of same length.
Ah I see, thanks for clarifying. This is a bit more outside my area of expertise, but I think I'm right in saying that many multiple sequence alignment processes are still just multi-pairwise alignments which are pulled together by building a guide tree (e.g. for neighbour joining), with a some final 'polishing steps'.
Are these scores related in any way to a conventional alignment? If so, it could be possible to reconstitute the aligned status using an approach like I've done in the past to apply the 'moves' from a CIGAR string to the vectors. I can share some code if so.
There is no alignment to be done, they are already aligned. You could score them, but there would be no aligning here if they are already the same length.
I think you are going to have to implement an algorithm from scratch yourself for this, as I'm not aware of any way you could coerce an existing tool to do this. You may need to refer to the original papers to extract the basis for the algorithm.
Yes but it looks much more complex than pairwise alignment ! So it's why I'm trying to adapt already existing scripts, I don't feel able to code everything myself.
I'd like to take into accounts insertions and deletions, like in conventional alignment. But I can't "summarize" my sequence of vectors in a sequence of letters, I would lost to much information.
On the wikipedia page of the needleman-wunsch algorithm, the example given is the alignement of GCATGCU and GATTACA. These 2 sequences have the same length, but it's still possible to do alignment, and the result would be GCATG-CU and G-ATTACA after insertions and deletions, right ? In any case, it's what I want to do for my problem
Thanks again for taking the time to answer me !
That's true actually on the alignment, good point - I must have been getting confused somewhere (long day)...
I'm afraid I can't be much more help on the MSA challenge, but as I said, my suspicion is that you may need to code this from scratch, unless you can approach the task in a totally different way. It might help if you tell us what the actual problem you're trying to solve is?