Dear Friends,
I am trying to perform phylogeny analysis on an assembled genomes. From BLASTn searchI got 250 genomes. I am looking to generated phylogeny tree using these genomes and the assembled genome. However, all programs I have used till now like, Mugsy, Clustal Omega, Mafft, failed in producing alignment for these genomes. The length of these genomes is about 80,000 bases to 100, 000 bases. Can you please let me know how can I perform phylogeny analysis using these genomes?
I have performed phylogeny analysis at gene level using the terminase subunit of these genomes, and got the results too, but I am interested in performing genome level phylogeny analysis. Thanks!
What if one of the genomes is assembled poorly, will that not influence the tree?
All validations are done for the assembled genome and it is significantly accurate. Regarding the other genomes from the BLASTn hit, i cannot say, but the sequence identity and query coverage is > 92% for most of the BLAST hits. Also, that is not the concern for me at the moment. Could you please let me know if you have any suggestion on the question I asked? Thanks!
Yes it is, otherwise the tree is useless. If the genomes are from the same organism you could do it based on the snps in the exons or orf's. But that requires some work. For evolutionary distance using marker genes like 16S, COI, ITS etc. is the easiest way I guess.
Sounds like these are phages, so finding 'markers' like 16S will likely not be an option.
There will be some reasonably well-conserved proteins (potentially like the terminase OP mentioned -- I don't know), but it depends how wide you want to cast the net. Things get weird quick when you study viruses.
Ah, thanks for the heads up. Indeed, especially with phages and the replication rate. Would it be an option for the OP the find some genes that are present in all the genomes so more then only one terminase and paste them together as one sequence and use that to make an alignment?
Thats one option I think yeah, but it would need some good literature backup, ideally with experimental proof that those genes are decently conserved and don't recombine too much. Even then, that probably only gets you out to the Family level if you're lucky. If you're comparing between phage families life is really hard (mainly because phage families have been historically determined morphologically, rather than genetically/evolutionarily).
There are tools available to simply cluster all of the orthologues in the genome though, so that should give you maximal evolutionary signal, and you might as well use as many genes as possible rather than trying to cherry pick a few and risk making dodgy assumptions.
The trick is not to use whole genomes. Its impossible to fully align (at all, let alone accurately) that many genomes of even that comparatively small size.
I would suggest you look at using concatenated orthologue alignments (i.e., cluster all the genes, do multiple alignments with them, then concatenate the alignments and use that to calculate a tree), or use
mash
distances as a surrogate for sequence identity and draw a tree using those as your sequence metric.What you're currently trying to do is simply never going to work.
Thanks! I have performed gene level (using terminase subunit of the genomes) and generated tree from that. Do you think that analysis is considerably OK to predict the family or order the assembled genome belong to? I am working on phages.
Could you please guide me on how to do this? Thanks!
I would suggest that a tree from a single gene, in something as highly recombinant as a phage, is unlikely to be enough. You probably need to do this a couple of times with other genes. If there is good evidence in the literature that that terminase is a reliable marker gene (akin to a 'housekeeping' gene in bacteria) then you might be fine (but I don't know for your specific case).
The best thing to do would be to find several genes (ideally as many as possible) and then try to reconcile their trees from each gene alignment.
For using
mash
distances, its pretty trivial, take a look at: https://github.com/lskatz/mashtreeNote,
mash
distances are not true evolutionary distances - they are more like an approximation. This would probably be sufficient to give you a reasonably accurate topology, given sufficient data, but you may not want to read too much in to branch lengths etc.