So, I want to map uniprot protein (main isoform) to ensembl (coding sequence) to estimate Ka/KS in very close related species.
I went to uniprot and downloaded protein sequences and ensembl id (ENST) I converted ENST to ENSG (because, if I had understood) a ENSG represent a physical location in genome and ENST are variants. Until this point everything is OK. But then I try to get the corresponding coding sequence . I went to ensembl and download sequence with ENSG. For each ENSG I look for the ENST who codes the corresponding protein. a large amount of ENSG (~50% ) don't have transcript which exactly match the protein sequence.
I had more success using exonerate on cds sequence (from ensembl) 6% of protein/DNA sequence have (mismatch, indel, insert). This is clearly better, but:
Is the exonerate way a good way to do this?
Why this amount of non matching uniprot protein ensemble coding sequence?
I think you misunderstood EnsEMBL annotations, ENSTs are not variants. ENSG denotes a gene and ENST denotes a transcript. Both have genomic coordinates and protein-coding transcripts have translations, often associated with a UniProt acc if the protein is represented in UniProt. Now if you have UniProt acc for one species that you want to map to proteins in EnsEMBL for another species, you could download the protein sequences from the EnsEMBL ftp site and use them for mapping with your tool of choice. However, keep in mind that even closely related species will have differences at the protein level, the number of differences will depend on how close or distant the species are.
Thanks you for your answers. It is helpfull. I had a problem to clearly understand what whas ENST.