Question

Multiple Sequence Alignment Of Thousands Of Proteins

9

Entering edit mode

13.6 years ago

Dror ▴ 280

I want to track the evolution of several domains, and for doing so, I need to align and cluster 1000's of sequences. is it possible? and what is the best software to use for that? Eventually I want to understand which is the most "basal" sequence that might lead me to the most ancient protein containing this sequence.

alignment clustering evolution domain • 7.9k views

ADD COMMENT • link updated 13.6 years ago by Andreas ★ 2.5k • written 13.6 years ago by Dror ▴ 280

score 12 · Answer 1 · 2011-04-24

12

Entering edit mode

13.6 years ago

2184687-1231-83- ★ 5.1k

"mafft --auto" is stable for up to hundreds of thousands of proteins and produces reasonable alignments: http://mafft.cbrc.jp/alignment/software/

ADD COMMENT • link 13.6 years ago by 2184687-1231-83- ★ 5.1k

2

Entering edit mode

Just as an example, the 10 biggest alignments in the Ensembl Families are ~50000 sequences, 20000, 14000, 12000, 10000, 9200, 9100, 7800, 7500 and 6800 sequences, all aligned with mafft auto

ADD REPLY • link 13.6 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

Just as an example, the biggest Ensembl Families are aligned with mafft auto and they are big:

+----------+-----------+ | count(*) | family_id | +----------+-----------+ | 54909 | 1 | | 19735 | 2 | | 14461 | 4 | | 12625 | 5 | | 10452 | 3 | | 9223 | 6 | | 9178 | 57 | | 7842 | 9 | | 7568 | 7 | | 6810 | 8 | +----------+-----------+

ADD REPLY • link 13.6 years ago by 2184687-1231-83- ★ 5.1k

score 4 · Answer 2 · 2011-04-25

4

Entering edit mode

13.6 years ago

Liam Thompson ▴ 140

Have you tried MUSCLE ? I've only used it for hundreds of sequences, and it produced a good alignment in good time. I think with a cluster or a beefy desktop it would probably work nicely.

ADD COMMENT • link 13.6 years ago by Liam Thompson ▴ 140

score 3 · Answer 3 · 2011-04-24

Hi Dror,

I am not aware of any application that accepts thousands of sequences and aligns with a greater accuracy. Fast Statistical alignment (http://fsa.sourceforge.net/) seems to accept a few hundred sequences, not sure how many exactly and if its going to furnish an accurate alignment. But if you really want to align that many sequences, why not partition the dataset, align them separately and then combine the alignments? I guess that will give you better alignments and will be less time consuming. Will let you know if I find any app that meets your requirement.

Cheers, Kartik

score 3 · Answer 4 · 2011-04-28

Even though this might be considered as shameless advertisement:

The new version of Clustal (Clustal Omega) is able to cope with this amount of (and many more) sequences when using the --mbed flag. See the announcement on the Clustal Homepage. It's currently a protein-only, command-line only, Unix-only, pre-publication beta version :)

Andreas