Dear members,
I want to align all the SARS-CoV-2 complete genome sequences that are available till date (444424 in NCBI viruses). I have tried command line muscle but it is unable to handle such a large number of sequences. I would appreciate if anyone can let us know how to align these huge number of sequences.
A large number of those assemblies are identical with one another or subsequences of longer assemblies. Quite a few also include degenerate bases which you would have to deal with somehow. My suggestion is that you first remove all redundancy from the data set by clustering..
in addition, there may be a polyA tail at the end, and that may have various lengths, for many genomes the sequences end like this:
Thank you for your reply. We intend to do that but after the alignment process is over..
use high blosum matrix for alignment
Thank you for the suggestion. Will surely try high blosum..