Question

How can I run multiple sequence alignment for a large number of proteins (~10k)

0

Entering edit mode

2.6 years ago

O.rka ▴ 750

What's the preferred method for running multiple sequence alignment on such a large amount of protein sequences? I'm trying something fairly experimental and running MSA would be really helpful in the approach.

I usually use muscle and noticed there is a super5 module that helps with this: https://drive5.com/muscle5/manual/cmd_super5.html

How can I adjust the parameters to help out with running out of memory? Alternatively, is there another tool that's better suited for this? Basically, I want a fasta MSA for the output.

muscle msa protein multiple alignment sequence • 2.1k views

ADD COMMENT • link updated 2.6 years ago by Mensur Dlakic ★ 29k • written 2.6 years ago by O.rka ▴ 750

0

Entering edit mode

Hi, take a look here : MAFFT

ADD REPLY • link 2.6 years ago by mohammadhassanj ▴ 260

0

Entering edit mode

I should have mentioned that some of the sequences are long. There are a few that are ~70k. I've trimmed them out and it's working now but I'll keep MAFFT in the back of my in case this fails.

ADD REPLY • link 2.6 years ago by O.rka ▴ 750

score 2 · Accepted Answer · 2023-01-12

2

Entering edit mode

2.6 years ago

Mensur Dlakic ★ 29k

As advertised in their paper, FAMSA is meant specifically for aligning huge protein families. Clustal Omega should work as well.

ADD COMMENT • link 2.6 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Wow, FAMSA is really fast AND memory efficient. Nice find thank you! Do you usually use single linkage, upgma, or nj?

ADD REPLY • link 2.6 years ago by O.rka ▴ 750

1

Entering edit mode

I have always used single linkage but with proteins that were not as long as yours. If speed and memory are not problematic for your computer setup, that should be the best choice.

ADD REPLY • link 2.6 years ago by Mensur Dlakic ★ 29k