I have an Ensembl Transcript, ENST00000029410 with a mutation CT at a 1-based position of 808. Mapping this transcript to a protein is very simple, as the transcript's position is its position in its coding chunks, so the mutation's position on the protein is simply ceil(808/3)=270.
This transcript also maps to 3 Refseq transcripts (according to Ensembl's Biomart): XM_006714816
, XM_006714815
, and XM_005265805
. I assumed a RefSeq transcript (XM or NM) represents the entire Ensembl transcript that maps to it, so I expected the position 808 on ENST00000029410 to also map to position 808 on each of the three Refseq transcripts. However, instead they mapped to three different positions: 833, 1044 and 1319, respectively. Where are these positions coming from? And how can they be used to find the mutation position on the resulting protein? Clearly dividing these positions by 3 does not result in a position of 270 on the resulting protein.
Hmm, interesting. But NM's are not always perfect matches with an ENST that maps to it? I'm basically just trying to figure out how to connect RefSeq transcript to my node graph of other identifiers (ENST,ENSG,ENSP,grch38 chromosome,uniprot, refseq protein). But in order to connect Refseq transcript to a node, I have to be confident about not only my ID conversions, but also my position conversions. Perhaps I could just connect it to Grch38 chromosome instead of to ENST? Would you happen to know of a file that converts Refseq transcript to their chromosomal positions? I have been looking through Refseq's DB to no avail.
There can be cases where the NMs aren't be a perfect match to an ENST. You can get the genomic coordinates for the NMs on this GFF3 file from NCBI. You can also get the start and end coordinates of the RefSeq transcripts using the Ensembl REST API. See this example.