I want to track the evolution of several domains, and for doing so, I need to align and cluster 1000's of sequences. is it possible? and what is the best software to use for that? Eventually I want to understand which is the most "basal" sequence that might lead me to the most ancient protein containing this sequence.
Just as an example, the 10 biggest alignments in the Ensembl Families are ~50000 sequences, 20000, 14000, 12000, 10000, 9200, 9100, 7800, 7500 and 6800 sequences, all aligned with mafft auto
Just as an example, the biggest Ensembl Families are aligned with mafft auto and they are big:
+----------+-----------+ | count(*) | family_id | +----------+-----------+ | 54909 | 1 | | 19735 | 2 | | 14461 | 4 | | 12625 | 5 | | 10452 | 3 | | 9223 | 6 | | 9178 | 57 | | 7842 | 9 | | 7568 | 7 | | 6810 | 8 | +----------+-----------+