Hello! I wonder if there is a way to convert CDS coordinates of a gene, to amino acid based coordinates , see the example:
five_prime_utr1 1 495
exon1 496 568
CDS1 496 568
intron1 569 698
exon2 699 968
CDS2 699 968
The protein for this given gene, will start at 496 nucleotide position (CDS1), so 496 will be the position 1 of the resultant protein. What would be the end position (568)? What about the next domain, the one that its CDS starts at 699 (CDS2)?
What I need is to identify the protein subsequence of my full protein that was coded by each CDS, f.e:
ABC DEF GHIJKL MNOPQ RST (full protein sequence, amino acid domains belonging to each CDS in bold)
------ CDS1 ----------- CDS2
------- 3-5---------------13-17
I've found similar questions such as: Python Framework For Converting Genomic To Protein Coordinates
Is there any solution in place already for this problem that seems quite common?
I am not sure if I fully understand your question but I think it would looklike - 5' UTR -- CDS domain 1 (72 nucleotides, 24 AA) -- intron sequence -- CDS domain 2 (269 nucleotides, this is not a multiple of 3, you may be missing one more nucleotide position?) When this gene is processed - the intron will be clipped off, and the exons (CDS domains) will be merged together. So the end position of first CDS domain would be (568-496)/3 = 72/3 = 24, and the 25th amino acid would be the first amino acid of the second exon
Hi! Sorry if I didn't explain it clearly. In a nutshell, what I need is "simply" to extract the amino acid sequence corresponding to exon 2 and exon 3 of a given gene. And what I have, is a fasta file with the gene sequence, another fasta file with the protein sequence, and a GTF with the annotation of the gene.
Ah, so you already have a gene sequence where all exons are merged together and you want to "separate" those exons? And then translate them to their individual amino acid sequences? I think this should be straightforward if you know the annotations, so you know how many exons you have, which one comes first, second.. and what is the length of every exon, then just "splice" the gene sequence according to this information - maybe a short for-loop would do the trick. So once you have separated exons, then "translate" every exon to its amino acid sequence - should be simple with BioPython.