I'm looking for the most suitable pipeline to perform a whole-genome multiple sequence alignment (MSA) of around 1000 vertebrate species. The goal is to identify conserved elements (and yes, I need this number of species).
As far as I know, the maximum number of sequences used in publicly available whole-genome MSA of vertebrate species is 100 (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=multiz100way).
I'm new to MSA, so I would like to know which pipeline would be the most efficient in my case and how much time and computational resources may I need (I have a slurm cluster, so parallelization is preferred).
Would it be OK to perform MSA in batches of 100 sequences and then concatenate the results somehow?
Check out Cactus: https://www.nature.com/articles/s41586-020-2871-y
Thanks a lot! That's what I needed!