Question

Large Scale Protein Alignment

2

Entering edit mode

11.5 years ago

jzabilansky ▴ 60

I am currently trying to perform data analysis on a data set containing over 25,000 sequences and wish to align them, is there a way I can do this efficiently that won't cause an alignment program to crash because of the size of the data?

protein multiple alignment • 4.1k views

ADD COMMENT • link updated 7.0 years ago by Biostar 20 • written 11.5 years ago by jzabilansky ▴ 60

1

Entering edit mode

Which programs have you used? e.g. have you tried clustal - http://www.clustal.org/omega/

ADD REPLY • link 11.5 years ago by Niallhaslam 2.3k

0

Entering edit mode

Can you tell us a bit more about your 25,000 sequences? Are they all for the same gene? A gene family? You want to do global alignments or assemble them?

ADD REPLY • link 11.5 years ago by Eric Normandeau 11k

0

Entering edit mode

They are all for the same gene and I wish to do global alignments.

ADD REPLY • link 11.5 years ago by jzabilansky ▴ 60

0

Entering edit mode

25000 sequences for the same gene sounds like an awful lot. Have you considered trimming the set a bit, and maybe just extracting the N most informative sequences? I know this can be done using t_coffee, but I'm not sure if that is suitable for such a big data set.

ADD REPLY • link 11.5 years ago by David Westergaard ★ 1.5k

score 4 · Answer 1 · 2013-07-16

Assuming that these are protein sequences you want to align, then as Niallhaslam suggests, Clustal Omega sounds like the best option (as noted in "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega" Clustal Omega has been tested with alignments of up to 200,000 sequences).

However if your sequences are DNA or RNA, I would suggest you look at MAFFT or Kalign instead. Since the method used in Clustal Omega, does not perform as well with nucleotide alignments (this is being worked on).

If your sequences are short and very similar then other multiple sequence alignment programs, such as MUSCLE and T-Coffee, might work, although the alignment may still require a lot of memory to complete successfully.

Ram · Answer 2 · 2013-07-15

1

Entering edit mode

11.5 years ago

jomaco ▴ 200

If you wish to align those proteins to a reference assembly you could use the exonerate (http://www.ebi.ac.uk/~guy/exonerate/) protein2genome model which models introns. I used this when I wanted to align proteins from the TAIR10 database to our reference genome. You would also probably want to split the file into considerably smaller chunks so that many faster individual alignments can be carried out before the results are merged - this way the alignment as a whole will be much quicker.

Edit: I assumed the proteins were being aligned to a reference sequence rather than to each other (in which case this solution would not be appropriate).

ADD COMMENT • link 11.4 years ago by jomaco ▴ 200

1

Entering edit mode

I'm glad you made the wrong assumption, as this is exactly what I wanted! In the spirit of stack exchange, perhaps I should write a specific question for you to answer? Hey! I just did: How to align a protein set to a genome?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.1 years ago by Dan ▴ 540

score 0 · Answer 3 · 2013-07-16

0

Entering edit mode

11.5 years ago

Abhiman ▴ 130

Kalign is a fast alignment program, which I have used to align large number of sequences (~50,000). It is available here http://msa.sbc.su.se/downloads/kalign/current.tar.gz