I am working on coding sequences of two plant species and have identified their orthologs using reciprocal best BLAST hits. For detecting the derived site in one of the species (of interest) I am considering the other to have the ancestral sequence - and the comparison for all sites between these two lets me look at the derived sites in the specie of interest.
I need to segregate the synonymous & non-synonymous derived sites for computing their summary statistics separately. I would like to understand the significance of classifying them differently and also some suggestions as to how can this be done.
1) With only two species, you cannot determine which state is ancestral and which state is derived. You need a minimum 3 OTUs with a clear outgroup to determine ancestral states, and even then you can be misled.
2) Synonymous and non-synonymous sites are treated differently since they have different selective regimes, and therefore their rates vary drastically from one another. Synonymous rates are more similar across genes than non-synonymous rates. See Wen-Hsiung Li's textbook for detailed consideration of this classic topic in molecular evolution.
Without an outgroup, you can't really say which are ancestral and which are derived. There are tons of different methods to calculate Ka/Ks, if that's what you're trying to do. For a starter, you can take a look at Nei and Gojobori 1986, to see what those parameters mean and how they are calculated. More modern method uses phylogenetic tree and codon-based model to calculate Ka/Ks site-by-site and lineage-by-lineage, as implemented in HyPhy.
No, you didn't understand. You need at least three to know ancestral or derived states, and only if you're pretty sure one of the three are ancestral to the other two, then you can use that as an outgroup. You can never use two taxa and figure out ancestral and derived. In terms of Tajima's D, if I understand right, it's a parameter used at population level. It's an indicator of rare allele frequencies, which means you need at least 3 entires/sequences. By the way, identifying orthologs are not as straightforward as you think, usually BLAST is not enough.
I have 50 accessions of one specie - one of them being the reference sequence. Basically I am using blast with the reference sequence and the other specie and finding genes with bidirectional best hit - as Khader has pointed out in this thread
When you say blast is not enough, are there any other (better) tools that can be used then?
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 13.3 years ago by
ngsgene
▴
380
0
Entering edit mode
I am studying a single specie and the other is hence being considered as the outgroup - as Bergman pointed out I need to look into that.
Initially I am looking to calculate Tajima's D - and trying to figure the segregation based on synonymous/non-synonymous - what needs to be segragated - if I am considering genes for Tajima's D - while the synonymous changes are at the amino acid level.
Blast-based ortholog identification is one type of approach, the other major type usually uses some kind of phylogenetic trees to guide the ortholog identification. There is a very good summary of this here. Personally, I'd prefer tree-based approaches.
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 13.3 years ago by
Vitis
★
2.6k
0
Entering edit mode
Also, for 50 accessions (individuals from different populations?) within one species, if they're very close (check ~10 genes?), it'd be straightforward to find the orthologs because genetic divergence within one species is usually very small. You probably can use the read mapping results to directly reconstructed the orthologs.
Thats a good point, I had been considering one of the two as an outgroup and deriving information based on that.