Question

New versions of the same transcripts ID may have different exons coordinates??

0

Entering edit mode

5 months ago

ManuelDB ▴ 110

I am uncertain if I understand the following statement correctly:

"The RefSeq transcript ID refers to the sequence of the transcript, represented by the NM_xxxxx.y accession number. The version is indicated by the number after the dot. Different versions of RefSeq transcripts have distinct sequences; for instance, additional sequences may be added to the UTRs or even the CDS. As a result, the transcript coordinates can change from one version to the next. This is why it is important to report the version number of the transcript, e.g., report NM_012309.4 instead of NM_012309."

https://genome-euro.ucsc.edu/FAQ/FAQgenes.html#:~:text=Different%20RefSeq%20transcript%20versions%20have,for%20readers%2C%20e.g.%20report%20NM_012309.

Does this mean that the start and end positions of the exons may change between versions of the same transcript ID? If so, why? In my view, these changes should warrant a new and independent ID.

RefSeq • 528 views

ADD COMMENT • link updated 5 months ago by i.sudbery 20k • written 5 months ago by ManuelDB ▴ 110

1

Entering edit mode

Yes, exon coordinates could change between different versions. Why this happens is probably best answered by people who do genome annotations extensively, but my understanding (as someone who has worked with transcript identifiers) is that assigning a version to the transcript (instead of having different IDs altogether) is easier to handle. Even though the exon position shifted by 2-3 bases, establishing a link between old and new transcripts just makes things easier to track across versions.

ADD REPLY • link 5 months ago by manaswwm ▴ 550

0

Entering edit mode

Thank you for your answer.

If this is true.... Why Biomart doesn't care about RefSeq versions transcripts ID.

Example. Biomart doesn't recognise this NM_004333.6 but it recognises this NM_004333. Then It allows me to extract the start and end of the exons as example below.

Gene stable ID  Gene stable ID version  Transcript stable ID    Transcript stable ID version    Exon region start (bp)  Exon region end (bp)    Exon rank in transcript Version (transcript)    Constitutive exon   cDNA coding start   cDNA coding end Genomic coding start    Genomic coding end
ENSG00000157764 ENSG00000157764.14  ENST00000646891 ENST00000646891.2   140850111   140850212   2   2   0   365 466 140850111   140850212
ENSG00000157764 ENSG00000157764.14  ENST00000646891 ENST00000646891.2   140834609   140834872   3   2   0   467 730 140834609   140834872
ENSG00000157764 ENSG00000157764.14  ENST00000646891 ENST00000646891.2   140808892   140808995   4   2   0   731 834 140808892   140808995
ENSG00000157764 ENSG00000157764.14  ENST00000646891 ENST00000646891.2   140807960   140808062   5   2   0   835 937 140807960   140808062

The number of errors found in this data is going to be proportional to the number of times genomics coordinates change between the version of transcripts IDs, isn't? If the version may modify genomics coordinates, version information is important.

Is this correct?

ADD REPLY • link 5 months ago by ManuelDB ▴ 110

1

Entering edit mode

Biomart is a tool for extracting data about ENSEMBL transcripts, not RefSeq transcripts. It includes information about which RefSeq transcript an Ensembl transcript is most like, and therefore allows you to use the RefSeq id to look up transcripts. However, what you are getting back is a ENSEMBL transcript, and there is never any guarentee that it is identical to the RefSeq transcript, just whichever is the closest match.

ADD REPLY • link 5 months ago by i.sudbery 20k

1

Entering edit mode

Well, changing the start/end coordinates of exons means issuing a new transcript_id (and I can understand the arguments for that), then there would never be any transcript version because all a transcript annotation just is the start and end coordinates of its exons and nothing else.

As to why the annotators make decide to use versions, rather than issuing a new transcript_id, imagine that there has been 20 years worth of research on the transcript NM12345, then then someone discovers that the 5' end of the final exon is 3nt out. It might even be that the 3' end of the penultimate exon is 3nt also, meaning and this happens to lead to no change in the coding sequence. Should all papers now cite a new transcript id (NM12346), meaning that unless you were very involved inthe subject, you would even connect the 20 years of literature on NM12345 to NM12346?

ADD REPLY • link 5 months ago by i.sudbery 20k