Phylogenetic tree for very large numbers of bacterial assemblies
1
0
Entering edit mode
4.9 years ago
yairgatt ▴ 10

Hello all, I am currently trying to construct a phylogenetic trees for a large number of bacterial assemblies (hundreds to thousands) derived from the same bacterial species. I am quite willing to lose a lot of the data and to use only the 16S sequences or a small subset of genes. Unfortunately, it seems many phylogenetic and phylogenomic methods are not capable of handling such a large number of sequences. Does anyone possibly know of a method that might be able to construct the tree I am looking for? We have a large cluster available and can spare up to several weeks for the construction.

Many thanks, Yair

Assembly alignment SNP phylogenetics • 1.8k views
ADD COMMENT
1
Entering edit mode

Can you quantify large by providing some ball park numbers? If you are only going to use a small subset of genes then you could remove redundancy and use a representative sequence a group of assemblies (if the sequence is identical). Programs like MAFTT should be able to handle large number of sequences.

ADD REPLY
0
Entering edit mode

Thank you for your reply. Ideally, we are looking in the ballpark of 5,000 assemblies. If that is not possible we have a smaller set of about 1,500 assemblies. I really like the idea about removing the redundancy! But do you think there is any way to do it with a set of several dozen well-conserved genes that would ideally together not have much redundancy between the assemblies?

ADD REPLY
1
Entering edit mode

Since these are same species assemblies there should be plenty of redundancy. Have you done any preliminary exploration?

ADD REPLY
0
Entering edit mode

I haven't done preliminary exploration yet, as I haven't determined the subset of genes I will use to perform the analysis. It seems I might need to include a large number of genes for this analysis, since I understand that the redundancy could be a problem. 16S is definitely not possible at this resolution.

ADD REPLY
1
Entering edit mode

Yairgat,

As genomax said you should remove redundant genomes. Use anipy or similar methods to reduce the dataset, then use bcgTree to for the phylogenetic analysis

ADD REPLY
0
Entering edit mode

Many thanks, I was not familiar with these methods!

ADD REPLY
1
Entering edit mode
4.9 years ago
Mensur Dlakic ★ 28k

ezTree will do what you want, though I do question the information one could get from building trees for thousands of very related assemblies. It is very likely that many of them will be identical, so removing redundancy should help. Even those that are non-identical will be > 99% identical, and at that point trees are unlikely to provide fine enough resolution to meaningfully separate your (sub)species.

ADD COMMENT
0
Entering edit mode

Thank you for the helpful comment! I am hoping to use the constructed tree as a null hypothesis of sort to compare to the phylogenetic profiles of several sequences. Do you think that using a core set of conserved genes will not have enough resolution to clearly separate the different strains to a few clades? Is there any other way I could construct a phylogenetic tree to sufficiently separate such close strains? I am afraid using programs like kSNP would not be possible with this number of assemblies.

ADD REPLY
1
Entering edit mode

Do you think that using a core set of conserved genes will not have enough resolution to clearly separate the different strains to a few clades?

Several things should be pointed out here. I assume that with that many assemblies it is unlikely that they will all be complete. That is both good and bad. If they were all complete, you would have thousands of shared genes to concatenate, which would make tree construction very difficult. On the other hand, the presence or absence of "core" proteins in various strains will likely be determined by randomness or sequencing rather than by their true conservation across different strains. Assuming that is the case, ezTree might end up with a random collection of "core" proteins that may or may not be informative.

I don't know enough about the setup of your experiment and differences between strains to make an educated guess whether you will be able to clearly separate strains. Regardless, if the strains are very related, that tree in my estimation can't be anything more than a convenient way to catalog your strains. By the way, have you ever looked at a tree with thousands of branches? I've done my share of looking at hundreds of branches, and can't imaging that I would ever want to sift through thousands of branches.

Lastly, these two programs may give you some indication about the distance between your strain, and they will be much faster than tree building.

https://github.com/marbl/Mash

https://github.com/dib-lab/sourmash

ADD REPLY
0
Entering edit mode

Regarding the experimental bias, it is definitely true, the difference in quality between different assemblies is tremendous. It is possible that using any "core" genes will require filtration for only assemblies that have a contig with a hit for the complete length of the gene, or something along that line.

Regarding the visual inspection, it won't be necessary, since we are hoping to utilize the resulting tree in a computational pipeline regarding conservation patterns.

Regarding MASH, I love it and it was also my first choice (creating a distance matrix using MASH and using it with something like kSNP), unfortunately running 5000*5000 MASH distances would take too long from my previous tests.

Thanks again for the helpful comments!

ADD REPLY

Login before adding your answer.

Traffic: 1395 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6