Question

Comparing Multiple-Sequence Alignment Pipelines

2

Entering edit mode

12.7 years ago

Nathan Harmston ★ 1.1k

Hi,

So I'm trying to generate a multiple sequence alignment from a number of species (a task, which I've never attempted before). However there are several ways to do this. Currently I'm running a blastz and multiz pipeline and I am planning on trying out other aligners such as TBA and PECAN.

My question is why should I prefer an alignment generated by pipeline A over an alignment generated by pipeline B (apart from my results look prettier)? Is it possible to do this quantitatively? Or at least in such a way that I can justify my choice of pipeline. What pipeline do you use for MSA and why do you think its better than a different one?

Hope this makes sense, many thanks in advance.

EDIT: I'll provide some more information - I'm trying to look for highly conserved elements between human, mouse, tetraodon and cow aswell looking for large areas of conserved synteny between the species.

alignment comparative • 6.0k views

ADD COMMENT • link updated 12.7 years ago by Javier Herrero ▴ 300 • written 12.7 years ago by Nathan Harmston ★ 1.1k

score 8 · Answer 1 · 2012-04-28

Hi Nathan

It is a very good question and people have discussed and will continue to discuss this for a long time. As a general rule, there is no perfect aligner and you should aim at an alignment that that is good enough for what you want to achieve.

If you want to know more on how MultiZ, TBA and PECAN compare, here are a few comments I hope you'll find them useful. First, MultiZ is not considered a multiple alignment really as it works by simply stacking or projecting pairwise alignments to a reference species. TBA will get all the possible all-vs-all pairwise alignments and combine them into a single multiple alignment that is (I believe) refined a posteriori. I would strongly suggest you use TBA instead of MultiZ if at all possible. Pecan works similarly except that it is a global aligner and it uses the concept of consistency (tries to make the alignment between sequences A and C consistent with the alignments between A and B, and B and C) to combine all pairwise alignments into the final multiple one.

A global aligner is an aligner that will align the sequences from start to end, assuming there are no rearrangements in the sequence. When applied to whole genome sequences, it requires you to define the blocks of collinear sequences you want to align. You can use software like Enredo or Mercator for this. TBA is a local aligner and won't require this additional step.

Coming back to which software produces the best alignments, Kim and Sinha ( http://www.biomedcentral.com/1471-2105/11/54 ) showed in 2010 that PECAN outperformed other aligners (ClustalW, DIALIGN-TX, MAFFT, MAVID, MLAGAN). The same year, Chen and Tompa ( http://www.nature.com/nbt/journal/v28/n6/abs/nbt.1637.html ) used a different method to compare aligners. Again, PECAN seemed to outperform other aligners, namely MLAGAN, MAVID and TBA, especially at longer evolutionary distances (human to a non-placental mammalian species).

In short, my recommendation would be to use PECAN, but I am certainly biased (it is the aligner we use in Ensembl and I work in Ensembl). You also want to consider whether you prefer to use a local aligner like TBA that doesn't require you to pre-define the collinear blocks to be aligned or you are happy to work on getting a good homology map first. This is perfectly doable, but will require extra work.

score 1 · Answer 2 · 2012-04-20

It's difficult to advice why you should prefer one aligner over the other because you don't give a lot of information. Judging from your use of blastz you are aligning whole genomes of human and mouse? The 'best' aligner depends on how closely related your sequences are, the size of them, but also what you rather want: very few mistakes, but when there is a mistake it is very big or a bit more mistakes, but the mistakes it makes aren't so big.

I myself tend to use ClustalX (or ClustalW) which does quite well on easy-to-align sequences. For sequences that are more difficult to align I go for Praline. However, I never used this for whole genome alignment, but I don't know what you are planning to use if for.