Hello all, I am currently trying to construct a phylogenetic trees for a large number of bacterial assemblies (hundreds to thousands) derived from the same bacterial species. I am quite willing to lose a lot of the data and to use only the 16S sequences or a small subset of genes. Unfortunately, it seems many phylogenetic and phylogenomic methods are not capable of handling such a large number of sequences. Does anyone possibly know of a method that might be able to construct the tree I am looking for? We have a large cluster available and can spare up to several weeks for the construction.
Many thanks, Yair
Can you quantify
large
by providing some ball park numbers? If you are only going to use a small subset of genes then you could remove redundancy and use a representative sequence a group of assemblies (if the sequence is identical). Programs like MAFTT should be able to handle large number of sequences.Thank you for your reply. Ideally, we are looking in the ballpark of 5,000 assemblies. If that is not possible we have a smaller set of about 1,500 assemblies. I really like the idea about removing the redundancy! But do you think there is any way to do it with a set of several dozen well-conserved genes that would ideally together not have much redundancy between the assemblies?
Since these are same species assemblies there should be plenty of redundancy. Have you done any preliminary exploration?
I haven't done preliminary exploration yet, as I haven't determined the subset of genes I will use to perform the analysis. It seems I might need to include a large number of genes for this analysis, since I understand that the redundancy could be a problem. 16S is definitely not possible at this resolution.
Yairgat,
As genomax said you should remove redundant genomes. Use anipy or similar methods to reduce the dataset, then use bcgTree to for the phylogenetic analysis
Many thanks, I was not familiar with these methods!