Hello, everyone!
I want to create the phylogeny of the sorotypes of a virus. I retrieved the sequences from NCBI, and this is what I have:
S1: 200 sequences S2: 87 sequences S3: 549 sequences S4: 8 sequence S5: 17 sequences
Should I align all these sequences to create a tree or should I choose one of each sorotypes?
Moreover, I chose another virus of the same family to be the outgroup. Should I align it together the other sequences?
Would you say that using the same amount of sequences of each serotype is important?
Not necessarily. It all depends on the complexity within a given serotype. If the serotype is very conserved for the genes you're looking at, then you can get away with fewer within any given predicted clade, since it's more or less certain they'll all group together.
Your tree might look better with a roughly equal number between serotypes, but its easy to collapse nodes and clusters after the fact and do all the aesthetic tweaks in whatever tool you use to draw the tree. It's much more difficult to go back and add data in.
The main point is that you need sufficient numbers of sequences within any cluster you expect to see to be confident that it's a real cluster.
I would perhaps aim for ~15 sequences per clade from your dataset. If the sequences themselves aren't very long, they won't take too long to align. You will want to look at the sequence diversity within any given clade first though, you'll need fewer sequences in a conserved clade to truly represent it's diversity, so you can perhaps scale the number of sequences you use accordingly.