Hi everyone,
I am new to phylogeny. I am trying to construct a phylogenetic tree of a gene family, including sequences from 3 related species. I started isolating and aligning the amino acid sequences of a conserved motif prevent in these genes (so, not the entire protein sequence). I edited the alignment by removing columns that showed less than 50% coverage and I used Raxml (100 bts) to get my tree. These genes are characterized by the fact that they are mainly organized in tandem gene arrays and I aim to combine phylogenetic and syntenyc data later on. So, I need a high-quality tree, but the fact that they are organised in such way (tandem duplication) may suggest a high recombination rate, and it is known that recombination may impact negatively on the tree. So, I was thinking of making an alignment with cDNA/genomic sequences, pre-processing this alignment to remove recombinant regions (by using GARD algorithm) and then using the output alignment to construct the tree. does this strategy make sense? what do you think?
thanks,
gentiaco
If I’m interpreting this right, you just want a ‘reference’ tree that you trust to reflect (as accurately as possible) the true phylogenetic history of the taxa?
To this, you then intend to apply syntenic information to infer recombinants?
Hi,
Thank you for your reply.
that's right! As I said I am new to phylogeny, and I only read that the effect of recombination may have a negative impact on the tree, leading to false positive outcomes. I also assumed that these genes, since located in tandem gene array loci, are probably characterized by high intergenic and interchromosomal recombination rate among them. So, I was thinking of a strategy to minimize this and improve the quality of my current tree (that is now only based on the alignment of a specify motif) by using either genomic DNA or cDNA sequence, aligning (via PRANk or MUSCLE) them and predicting the high and low recombinant sequence regions (by using GARD software). Then, I will re-align the non-recombinant / low recombinant regions, end use Raxml to build the tree. I will then use that tree and syntenic analysis data to investigate the evolution of these loci and how evolution acted on them. Does this strategy to improve the quality of the tree make sense? should I only rely on a complete DNA alignment rather following this protocol? on "The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing", it is reported that this GARD-mediated processing of DNA alignment can be a good idea if you want to proceed to investigate how selection acted on the sequences of interest. however, I have not seen this approach applied to phylogenetic trees.
I’m not really expert, but that sounds reasonable to me.
I think it would be sufficient to use concatenated core genome genes to get an accurate phylogeny. Even if there are some core genes which are subject to recombination, the signal/noise ratio from the static core genes should be plenty good enough to get an accurate tree.
I’m not familiar with GARD specifically though, so I can’t really comment on that aspect.
Thank you for your answer! I forgot to specify that I aim to construct the phylogenetic tree of this gene family, not the tree for these 3 species. That’s why I’m worried about recombination. Anyway, if you say that it is reasonable, I’ll try
Ah ok I think I understand now - ignore my previous comment.
Rather than aligning all the genes individually, why not perhaps align the whole tandem array instead? You could use that as your ‘parent tree’. I would guess that that might hold enough of the signal from the ‘genome’. Excepting of course, in the situation that the whole tandem array has recombined.
Then align all the genes together to investigate any specific differences between the genes if that’s of interest too.
Unpicking recombination in phylogenies is still a very difficult and actively studied problem.
Thanks, I think that the idea of aligning directly the tandem gene arrays is good, but also challenging because of:
a) the length of these sequences b) the high number of these arrays c) the existance of genes of this family not found in these arrays d) the occasional presence of long species-specific insertions that split the arrays and the unconserved intronic patterns.
I will probably have a look at the arrays later by using the CoGe tool and quota alignment.
For the next step, I will probably use DNA sequence instead of protein and see if there is a a way to minimize the effect of recombination.