I wish to use mafft to do sequence alignment on a large protein sequence dataset which contains over 100,000 sequences with average sequence length being 1000 residues. I guess I need to use a supercomputer.
Does anyone know how many CPU cores and how large memory does it need to run the alignment smoothly?
Can mafft estimate the time it needs to finish an alignment? by itself And what will be the estimated time to finish above alignment if enough computational resources is input?
Have you tried any configuration? Say, 32GB RAM + 12 cores? The speed at which you see results should tell you if you need to increase the speed. You could also start with a wall time of 48-72 hours and tweak it from there.
out of curiosity: how did you get to the dataset of 100,000 proteins?
does it need to be done with mafft or is any other aligner also fine?
any is fine. And what if: Assumed the similarity is very high, it should not be too hard. Can I do it on my notebook with 16 GB memory, but my virtual linux system has only 8 GB ?