Hi all,
I have constructed a phylogenetic tree for a set of 427 species using aligned and concatenated sequences for two genes (18s and COI). The species set is very taxonomically diverse and includes multiple different phyla.
My workflow is as follows:
- Align sequences with MUSCLE
- Trim alignments with trimAl (using heuristic determination of best trimming method)
- Concatenate gene alignments with geneStitcher.py
- Perform maximum likelihood tree inference + parametric bootstrap with iqtree2
The topology of the tree generally looks okay, but the node bootstrap values are extremely low (1 - 4%) across the entire tree which is concerning. I know alignment quality can impact bootstrap support, but had hoped that running the alignments through trimAl would reduce any issues, especially as the dataset is too large to manually edit.
Are there other potential sources of error that would be causing such low bootstrap values?
There could be multiple reasons, for example that the two genes convey conflicting phylogenetic information, why did you choose exactly those two (ribosomal RNA gene + mitochondrial protein coding gene)?
Some more ideas:
In conclusion, I think the choice or combination of genes is the problem here. If you want to do a multi-gene phylogeny I would start from a protein-level alignment of single-copy orthologs or stick with the 18S sequence only, or both.
Thanks for the advice! I decided to use both 18S and COI to provide for resolution among closely and more distantly related species since there is considerable taxonomic diversity in my species set.
The substitution model for each of the gene sub-alignments was selected by iqtree2 which uses ModelFinder. I inspected the COI and 18S alignments individually and ran the phylogenetic reconstruction for each gene and in both cases the COI alignment/tree seem better than the 18S. The COI tree has bootstrap values of around 12%, while the 18S bootstraps are around 1.5%.
Ok, so now I'd say things have slightly improved, but it also becomes more difficult to provide more advice without seeing the actual tree and input data. Do all nodes have low support values, for example or are there some that have good support also. You could also experiment with many different tools and parameters but it is easy to get absorbed by the many options and combinations. Some more ideas:
To make this easier I would create a smaller subset of say 10-50 species (including some good and bad branches) to experiment and also allowing for inspecting the alignments visually.