Hello, I am building my first phylogeny. It is shallow, I am trying to see the relationship between multiple populations of the same species. I am using the phyluce pipeline, I tried-no-trim, edge-trim and internal trim alignments. Every run results in trees with low bootstrap values (ranging from 20-60) and the branch lengths are exceptionally long. I read that misalignments can lead to such results. But how do I fix this? What are the ways to rectify misalignments of sequences? I also tried muscle, but the results weren't great. Any help is greatly appreciated, thanks!
Thanks for your reply. I am using a dataset of UCE loci obtained from multiple individuals from many populations of the same species and an outgroup. We sequenced our own data, I used mafft to align with the no-trim option. I viewed the alignment and it seems like there is a lot of missing data. Do you have any suggestions for handling long branch attraction?
LBA is a complex problem, quite impossible to circumvent in a short online answer. Generally speaking, sequencing UCEs to differentiate populations is tricky - you might not have enough signal in the data. If you want to be able to justify your pipeline, I recommend diving into the literature and also exploring and understanding what's in your data. All of this is very normal for phylogenomic analysis, so don't be discouraged.
Short answer: filter for missing data, use partitioned analysis, remove individuals if needed.
Are the different loci concatenated together? This is something to avoid, different loci can have different substitution rates. instead look if specific loci are prone to missing data. One can construct a phylogenomic tree for different loci or different combinations of taxa and see if specific populations or loci are prone to LBA, or if removing one taxa changes the overall topology. Otherwise, filtering for missing data to remove unreliably aligned regions or sites that evolve faster than others can help, as well as carefully considering the substitution model in inference. for the latter, I recommend using iqtree2. It's quite fast, the documentation is good and it runs a number of tests for site composition and model-testing that are very informative. A partitioned analysis may give you more robust results that concatenating all sites together.
Thank you so much for your time! Your suggestions are giving me a sense of direction :)