Question

Multiple Sequence Alignment Tool for Large Dataset

0

Entering edit mode

4.0 years ago

pfee418 ▴ 10

Hi everyone, I'm currently planning to run MSA on all human coronaviruses. I have 8,586 complete genomes/sequences and the FASTA file size is around ~256MB.

May I ask does MAFFT can accept this number of sequences as input? I heard about MUSCLE and T-Coffee before and their accuracy are quite okay, but I'm not sure whether they can handle this large dataset or not. May I ask are there any renowned MSA tools that can handle this much of dataset? My requirements for the MSA tool are having moderate accuracy (if can achieve high accuracy will be even better), short computational time (as short as possible) and of course, can handle my large dataset :'D

I found that there are MSA tools are designed for handling large datasets but I never heard any of them, so if possible can introduce me some MSA tools that are widely used for large datasets and are proved to be useful and reliable?

Thank you in advanced for all the suggestions and explanations and I will appreciate all the responses. :)))

alignment genome sequence coronavirus • 2.8k views

ADD COMMENT • link updated 4.0 years ago by Mensur Dlakic ★ 28k • written 4.0 years ago by pfee418 ▴ 10

score 1 · Answer 1 · 2020-12-04

1

Entering edit mode

4.0 years ago

Mensur Dlakic ★ 28k

A short answer is no. There is no common alignment tool that is meant for aligning tens of thousands of nucleotide sequences that are ~30KB long. That said, because of a unique demand for COVID-19 sequences, MAFFT authors have an experimental service that might be what you want. I think it works only for very related sequences.

If these were protein sequences, Clustal-Omega could be worth trying as well.

ADD COMMENT • link 4.0 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Oh I see, thank you for the suggestions and explanations. Unfortunately, the special option provided by MAFFT only applied for SARS-CoV-2 genomes that are very closely related (~95% identity). My sequences contain other human coronaviruses as well so this special option is not suitable for my data

ADD REPLY • link 4.0 years ago by pfee418 ▴ 10