Hello
I am a beginner in the selection analysis. I am studying about a large protein family. It includes 540 sequences from 139 plant species. The phylogenetic tree is divided to several subgroups (clade). Each subgroup contains several sequences from different species. I tested each group in terms of their final product and important motifs. Then I did the expansion analysis. Now, i want to do positive selection analysis on these sequences using PAML. I'm so confused. I do not know if i should examine each subgroup in the phylogenetic tree separately, or examine the species separately. For example, Arabidopsis has 11 copies of this gene in its genome, but they scattered in different clusters. Which strategy is better? Look at species that have experienced expansion like Arabidopsis, or clades that have different characteristics.
Another point, my computer is not able to quickly analyze the large data (Even up to 50 sequences). Can I split large subgroups into smaller or choose just a few species and analyze them separately? For example, study Arabidopsis genes with each other. Does it have no negative effect on the result? I thought that I could find a lot of positively selected sites, because of the large number of duplicated gene and expansions in this family. But when I checked the clusters separately, I could not find positively selected sites under site specific models. Is anything wrong?
I would be grateful if you could guide me.
Hi,
I think to focus on the specific clades/lineages you are interested in is good. This paper and this one may be helpful to have a look.
Also, It seems to me that the distribution of all plant species might be too broad to do PAML positive selection analysis.
Thanks a lot. Your advice was very helpful.