Question

How To Solve "Input Sequence" Problem In Multiple Sequence Alignment ?

4

Entering edit mode

13.0 years ago

User1029725 ▴ 100

Randomizing input order of sequences gives completely different alignments. Is there a way to address this problem ?

More info: I am using MUSCLE algorithm for performing MSA.

multiple random • 7.9k views

ADD COMMENT • link updated 12.7 years ago by Andreas ★ 2.5k • written 13.0 years ago by User1029725 ▴ 100

0

Entering edit mode

How many sequences are you aligning with MUSCLE?

ADD REPLY • link 13.0 years ago by Larry_Parnell 16k

0

Entering edit mode

in the range of 15000 - 30,000

ADD REPLY • link 13.0 years ago by User1029725 ▴ 100

0

Entering edit mode

usually in the range of 15K-30K

ADD REPLY • link 13.0 years ago by User1029725 ▴ 100

0

Entering edit mode

usually in the range of 15K-30K (funny, but true)

ADD REPLY • link 13.0 years ago by User1029725 ▴ 100

0

Entering edit mode

Do you have similar issues in other alignment programs? Have you tried MAAFT?

ADD REPLY • link 12.9 years ago by Steve Moss 2.3k

0

Entering edit mode

Do you have similar issues in other alignment programs? Have you tried MAAFT? http://mafft.cbrc.jp/alignment/software/

ADD REPLY • link 12.9 years ago by Steve Moss 2.3k

score 1 · Answer 1 · 2011-12-17

1

Entering edit mode

13.0 years ago

Martin A Hansen 3.0k

Bob Edgar (Author of Muscle) wrote this blog entry on big alignments:

Consider using Uclust.

ADD COMMENT • link 13.0 years ago by Martin A Hansen 3.0k

3

Entering edit mode

This answer is not related to the question and uclust has nothing to do with multiple alignments.

ADD REPLY • link 13.0 years ago by Andreas ★ 2.5k

1

Entering edit mode

The point is - as stated by Bob Edgar - huge alignements are nonsense. The meaningful thing to do is clustering to bin alignable sequences.

ADD REPLY • link 13.0 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Thanks for the info ! Is there any reference to effect of input order on alignments (my original query) in Bob's blog Or Did I miss something ?

ADD REPLY • link 13.0 years ago by User1029725 ▴ 100

0

Entering edit mode

The point is - as stated by Bob Edgar - huge alignements are nonsense. The meaningful thing to do is clustering to bin alignable sequences. Of cause this is not the answer to the misguided question.

ADD REPLY • link 13.0 years ago by Martin A Hansen 3.0k

0

Entering edit mode

If that was true, than all Pfam full alignments would be nonsense. Of interest here is also a new paper from Chris Sanders's group (http://www.ncbi.nlm.nih.gov/pubmed/22163331), where the authors used "big" protein alignments to accurately predict folds using statistical methods, which is not possible with "small" alignments.

ADD REPLY • link 12.9 years ago by Andreas ★ 2.5k

score 0 · Answer 2 · 2011-12-17

0

Entering edit mode

13.0 years ago

Larry_Parnell 16k

I found the following code that shuffles the order of sequences in fasta format. The "perl script randomly shuffles the order of sequences in a fasta file. Upon execution, specify your input file (without .fasta extension) and total no. of sequences." Feed that output into MUSCLE.

ADD COMMENT • link 13.0 years ago by Larry_Parnell 16k

1

Entering edit mode

Exactly what the code to which the link in my response will do. That code shuffles not the sequence, but the sequence order. So, sequences, 1,2,3,4,5 will become 4,2,3,5,1, for example. Now, with this randomized ordering of the input sequences, you can test for the "input sequence" bias.

ADD REPLY • link 12.9 years ago by Larry_Parnell 16k

0

Entering edit mode

Sorry, If my query is not clear. I am looking for ways to remove the input order bias. In other words, if I change input order, I am obtaining completely different alignment. I want to know if there is any way in which no-matter-what-the-input-order-is I will always obtain similar alignment (if not identical)

ADD REPLY • link 13.0 years ago by User1029725 ▴ 100

0

Entering edit mode

Sorry, If my query is not clear. I am looking for ways to remove the input order bias. In other words, I want to know if there is any way in which no-matter-what-the-input-order-is I will always obtain similar alignment (if not identical)

ADD REPLY • link 13.0 years ago by User1029725 ▴ 100

score 0 · Answer 3 · 2011-12-17

0

Entering edit mode

13.0 years ago

Andreas ★ 2.5k

I think this "problem" arises because at some stage an asymmetric pairwise distance measure is computed, i.e. the result depends on ordering. However, I'm not sure where exactly this happens in Muscle. The first distance used there (K-mer distance) should be symmetric. Does a -maxiter 1 always give the same result? A manual way to get rid of this would be to sort the sequences first according to some criterium (e.g. length) but there's of course no guarantee that this would give better alignments.

Andreas

ADD COMMENT • link 13.0 years ago by Andreas ★ 2.5k

0

Entering edit mode

Yes, I was also thinking in lines of sorting.

Till now, I was using -maxiter 5. Let me see what -maxiter 1 gives.

ADD REPLY • link 13.0 years ago by User1029725 ▴ 100

0

Entering edit mode

Yes, I was also thinking in lines of sorting. Yeah, -maxiter 1 must give same result, but let me run a test dataset and confirm !

ADD REPLY • link 13.0 years ago by User1029725 ▴ 100

0

Entering edit mode

I am surprised, even -maxiter 1 doesn't give same alignment ! Any idea, why ?

ADD REPLY • link 13.0 years ago by User1029725 ▴ 100

0

Entering edit mode

Even -maxiter 1 doesn't give same alignment for randomized input order. I think this might happen if more than two sequences have same pair-wise k-mer score. In that case, either of them is aligned before other resulting in different alignments every time.

ADD REPLY • link 13.0 years ago by User1029725 ▴ 100

0

Entering edit mode

Then I'd go for sorting. You might also want to try Mafft or in case of protein sequences Clustal Omega (which has in internal switch to sort sequences first).

ADD REPLY • link 13.0 years ago by Andreas ★ 2.5k