I am interested in automating the generation of phylogenetic trees from subsets of orthologous genes from a large ortholog dataset. The orthologs were generated to annotate a transcriptome, so everything is cDNA.
I would like for the trees/MSA to be robust enough to be publishable, but I would like to avoid "overengineering". That is to say, I realize that the strongest trees would come from something like running MrBayes after the appropriate substitution model selection, but that seems like something non-trivial to automate and is computationally expensive. I'm also not sure if it is "overkill" for generating trees from cDNA transcript sequence.
On the other hand, with something like ClustalW or ClustalOmega, it is tivial to automate and compared to the MrBayes pipeline requires fewer CPU hours.
Does anyone have experience in automating phylogenetic tree generation? Is something like ClustalW/Omega sufficiently powerful for what I'm interested in, or is it worth going the extra mile and dealing with the added complexity and CPU hours MrBayes would require?
I am working with rather large distances, my species are spread out among the three major mammilian super families. Most are from Laurasiatheria, which is a hot topic in systematics. Would this impact anything?
Not sure about this. Maybe you could do a small experiment: take 10~20 ortholog groups, run clustal alignment with manual inspections, run substitution model selection and MrBayes vs run clustalal and phyml in a wrapped script, to see how different the end results would be. It's always nice to start with a small set of genes before automating for the entire transcriptome.
It will probably be a more targeted approach than doing this for the whole transcriptome, however it will be enough genes that it would be prohibitive to do manually each time. Following up on your latest comment, do you think it would be worth just going with MrBayes?
What exactly am I looking for during manual inspection? I know gap removal is one of the things, but can't I just remove those with a gap removal tool/script?
When inspecting the alignments you may make some 'informed decisions' to remove some regions without good 'homology' or 'hard to tell' and/or some small adjustments to assign the correct homology, which was purely subjective. So I was never a big fan of manual inspection of multiple sequence alignments. I think automated gap remover would do as good because at least it's not biased. In terms of MrBayes, you still need to choose the best site substitution model for each locus before running that (any good way to automate it, maybe a shell pipe?). And I don't think there is a perl wrapper for MrBayes, but I think it's fairly easy to program it in shell pipes.
A more interesting problem, although probably not directly related to your initial question, might be how you deal with the conflicting phylogenetic signal from the large number of ortholog groups. And I'm sure there will be interesting patterns comparing reconciled trees made from individual loci and trees made from concatenated multi-locus dataset.
This is part of the plan. I wanted to take a set of orthologous groups all with the same GO term or terms (or some other annotation) and estimate phylogeny off of these. As for dealing with conflicts, I have no idea.