Question

Automatically Generating Robust Phylogenetic Trees

1

Entering edit mode

11.3 years ago

pld 5.1k

I am interested in automating the generation of phylogenetic trees from subsets of orthologous genes from a large ortholog dataset. The orthologs were generated to annotate a transcriptome, so everything is cDNA.

I would like for the trees/MSA to be robust enough to be publishable, but I would like to avoid "overengineering". That is to say, I realize that the strongest trees would come from something like running MrBayes after the appropriate substitution model selection, but that seems like something non-trivial to automate and is computationally expensive. I'm also not sure if it is "overkill" for generating trees from cDNA transcript sequence.

On the other hand, with something like ClustalW or ClustalOmega, it is tivial to automate and compared to the MrBayes pipeline requires fewer CPU hours.

Does anyone have experience in automating phylogenetic tree generation? Is something like ClustalW/Omega sufficiently powerful for what I'm interested in, or is it worth going the extra mile and dealing with the added complexity and CPU hours MrBayes would require?

alignment phylogenetics phylogeny • 3.4k views

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.3 years ago by pld 5.1k

Ram · Answer 1 · 2014-04-21

1

Entering edit mode

11.3 years ago

Vitis ★ 2.6k

I've faced a similar problem: building gene trees in batches for CDS from a set of species. I think I wrapped everything in perl with pulling the CDS together, translate into protein sequences, align them with ClustalW/Tcoffee, convert back to nucleotide sequences, then calling phyml using a perl wrapper. The automation took away two good practices in phylogeny reconstruction: I couldn't inspect the alignments to make sure the homology of sites (but aligning protein and converting back to nuelotides had this partially covered); I couldn't tune the site evolution model for each gene/gene family. In the end, it worked pretty well, but I should make the disclamier that I worked within a plant family, so the tuning of parameters in tree building may not be as important as building trees over greater evolutionary distances, as I would expect much greater rate heterogeneity across genomes in that case. Finally, for the sake of saving time and computational power, I didn't do bootstrap analysis for each gene/gene family, which is possible in the phyml wrapper.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by Vitis ★ 2.6k

0

Entering edit mode

I am working with rather large distances, my species are spread out among the three major mammilian super families. Most are from Laurasiatheria, which is a hot topic in systematics. Would this impact anything?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by pld 5.1k

0

Entering edit mode

Not sure about this. Maybe you could do a small experiment: take 10~20 ortholog groups, run clustal alignment with manual inspections, run substitution model selection and MrBayes vs run clustalal and phyml in a wrapped script, to see how different the end results would be. It's always nice to start with a small set of genes before automating for the entire transcriptome.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by Vitis ★ 2.6k

0

Entering edit mode

It will probably be a more targeted approach than doing this for the whole transcriptome, however it will be enough genes that it would be prohibitive to do manually each time. Following up on your latest comment, do you think it would be worth just going with MrBayes?

What exactly am I looking for during manual inspection? I know gap removal is one of the things, but can't I just remove those with a gap removal tool/script?

ADD REPLY • link 11.3 years ago by pld 5.1k

0

Entering edit mode

When inspecting the alignments you may make some 'informed decisions' to remove some regions without good 'homology' or 'hard to tell' and/or some small adjustments to assign the correct homology, which was purely subjective. So I was never a big fan of manual inspection of multiple sequence alignments. I think automated gap remover would do as good because at least it's not biased. In terms of MrBayes, you still need to choose the best site substitution model for each locus before running that (any good way to automate it, maybe a shell pipe?). And I don't think there is a perl wrapper for MrBayes, but I think it's fairly easy to program it in shell pipes.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by Vitis ★ 2.6k

0

Entering edit mode

A more interesting problem, although probably not directly related to your initial question, might be how you deal with the conflicting phylogenetic signal from the large number of ortholog groups. And I'm sure there will be interesting patterns comparing reconciled trees made from individual loci and trees made from concatenated multi-locus dataset.

ADD REPLY • link 11.3 years ago by Vitis ★ 2.6k

0

Entering edit mode

This is part of the plan. I wanted to take a set of orthologous groups all with the same GO term or terms (or some other annotation) and estimate phylogeny off of these. As for dealing with conflicts, I have no idea.

ADD REPLY • link 11.3 years ago by pld 5.1k