Hi, I want to run postive selection to one of my target gene by using codeml of PAML. At this moment, I have 156 sequences of this target gene, but there are only 15 haplotypes of the 156 sequences. I would like to know should I use all 156 sequences to run positive selection? Or should I use the 15 haplotypes sequences for positive selection? Thanks for helping me out of this quetion!
I have tried working with using all sequences and also working with only the unique haplotypes. In both the cases, the result is same. Using less number of sequence is, for sure, computationally less expensive. So, working with 15 haplotypes is advisable.
It sounds like you are using data from many closely-related individuals if there are that many shared haplotypes. Using a dataset with currently segregating polymorphisms (as from population-level sampling) will inflate your estimates of omega. codeml works best with fixed polymorphisms among divergent groups - the 'power' to detect selection increases with the distance among sequences (i.e., longer branches = more power). To convince yourself of this, consider whether or not you could detect selection using a tree that has little resolution vs. a tree that has lots of structure.
If you absolutely cannot get around these restrictions and still need to use codeml, I would recommend using only the haplotypic data. The branches between individuals with similar/identical haplotypes will be small and will not provide any power to the analysis.
You can estimate a tree using a variety of approaches and then load that tree into codeml as a starting point or fixing the branch lengths (after converting them to units of substitutions per codon!). You could visualize the tree at this point.