How To Solve "Input Sequence" Problem In Multiple Sequence Alignment ?
3
4
Entering edit mode
13.0 years ago
User1029725 ▴ 100

Randomizing input order of sequences gives completely different alignments. Is there a way to address this problem ?

More info: I am using MUSCLE algorithm for performing MSA.

multiple random • 8.0k views
ADD COMMENT
0
Entering edit mode

How many sequences are you aligning with MUSCLE?

ADD REPLY
0
Entering edit mode

in the range of 15000 - 30,000

ADD REPLY
0
Entering edit mode

usually in the range of 15K-30K

ADD REPLY
0
Entering edit mode

usually in the range of 15K-30K (funny, but true)

ADD REPLY
0
Entering edit mode

Do you have similar issues in other alignment programs? Have you tried MAAFT?

ADD REPLY
0
Entering edit mode

Do you have similar issues in other alignment programs? Have you tried MAAFT? http://mafft.cbrc.jp/alignment/software/

ADD REPLY
1
Entering edit mode
13.0 years ago

Bob Edgar (Author of Muscle) wrote this blog entry on big alignments:

Consider using Uclust.

ADD COMMENT
3
Entering edit mode

This answer is not related to the question and uclust has nothing to do with multiple alignments.

ADD REPLY
1
Entering edit mode

The point is - as stated by Bob Edgar - huge alignements are nonsense. The meaningful thing to do is clustering to bin alignable sequences.

ADD REPLY
0
Entering edit mode

Thanks for the info ! Is there any reference to effect of input order on alignments (my original query) in Bob's blog Or Did I miss something ?

ADD REPLY
0
Entering edit mode

The point is - as stated by Bob Edgar - huge alignements are nonsense. The meaningful thing to do is clustering to bin alignable sequences. Of cause this is not the answer to the misguided question.

ADD REPLY
0
Entering edit mode

If that was true, than all Pfam full alignments would be nonsense. Of interest here is also a new paper from Chris Sanders's group (http://www.ncbi.nlm.nih.gov/pubmed/22163331), where the authors used "big" protein alignments to accurately predict folds using statistical methods, which is not possible with "small" alignments.

ADD REPLY
0
Entering edit mode
13.0 years ago

I found the following code that shuffles the order of sequences in fasta format. The "perl script randomly shuffles the order of sequences in a fasta file. Upon execution, specify your input file (without .fasta extension) and total no. of sequences." Feed that output into MUSCLE.

ADD COMMENT
1
Entering edit mode

Exactly what the code to which the link in my response will do. That code shuffles not the sequence, but the sequence order. So, sequences, 1,2,3,4,5 will become 4,2,3,5,1, for example. Now, with this randomized ordering of the input sequences, you can test for the "input sequence" bias.

ADD REPLY
0
Entering edit mode

Sorry, If my query is not clear. I am looking for ways to remove the input order bias. In other words, if I change input order, I am obtaining completely different alignment. I want to know if there is any way in which no-matter-what-the-input-order-is I will always obtain similar alignment (if not identical)

ADD REPLY
0
Entering edit mode

Sorry, If my query is not clear. I am looking for ways to remove the input order bias. In other words, I want to know if there is any way in which no-matter-what-the-input-order-is I will always obtain similar alignment (if not identical)

ADD REPLY
0
Entering edit mode
13.0 years ago
Andreas ★ 2.5k

I think this "problem" arises because at some stage an asymmetric pairwise distance measure is computed, i.e. the result depends on ordering. However, I'm not sure where exactly this happens in Muscle. The first distance used there (K-mer distance) should be symmetric. Does a -maxiter 1 always give the same result? A manual way to get rid of this would be to sort the sequences first according to some criterium (e.g. length) but there's of course no guarantee that this would give better alignments.

Andreas

ADD COMMENT
0
Entering edit mode

Yes, I was also thinking in lines of sorting.

Till now, I was using -maxiter 5. Let me see what -maxiter 1 gives.

ADD REPLY
0
Entering edit mode

Yes, I was also thinking in lines of sorting. Yeah, -maxiter 1 must give same result, but let me run a test dataset and confirm !

ADD REPLY
0
Entering edit mode

I am surprised, even -maxiter 1 doesn't give same alignment ! Any idea, why ?

ADD REPLY
0
Entering edit mode

Even -maxiter 1 doesn't give same alignment for randomized input order. I think this might happen if more than two sequences have same pair-wise k-mer score. In that case, either of them is aligned before other resulting in different alignments every time.

ADD REPLY
0
Entering edit mode

Then I'd go for sorting. You might also want to try Mafft or in case of protein sequences Clustal Omega (which has in internal switch to sort sequences first).

ADD REPLY

Login before adding your answer.

Traffic: 1208 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6