I am uncertain if I understand the following statement correctly:
"The RefSeq transcript ID refers to the sequence of the transcript, represented by the NM_xxxxx.y accession number. The version is indicated by the number after the dot. Different versions of RefSeq transcripts have distinct sequences; for instance, additional sequences may be added to the UTRs or even the CDS. As a result, the transcript coordinates can change from one version to the next. This is why it is important to report the version number of the transcript, e.g., report NM_012309.4 instead of NM_012309."
Does this mean that the start and end positions of the exons may change between versions of the same transcript ID? If so, why? In my view, these changes should warrant a new and independent ID.
Yes, exon coordinates could change between different versions. Why this happens is probably best answered by people who do genome annotations extensively, but my understanding (as someone who has worked with transcript identifiers) is that assigning a version to the transcript (instead of having different IDs altogether) is easier to handle. Even though the exon position shifted by 2-3 bases, establishing a link between old and new transcripts just makes things easier to track across versions.
Thank you for your answer.
If this is true.... Why Biomart doesn't care about RefSeq versions transcripts ID.
Example. Biomart doesn't recognise this NM_004333.6 but it recognises this NM_004333. Then It allows me to extract the start and end of the exons as example below.
The number of errors found in this data is going to be proportional to the number of times genomics coordinates change between the version of transcripts IDs, isn't? If the version may modify genomics coordinates, version information is important.
Is this correct?
Biomart is a tool for extracting data about ENSEMBL transcripts, not RefSeq transcripts. It includes information about which RefSeq transcript an Ensembl transcript is most like, and therefore allows you to use the RefSeq id to look up transcripts. However, what you are getting back is a ENSEMBL transcript, and there is never any guarentee that it is identical to the RefSeq transcript, just whichever is the closest match.
Well, changing the start/end coordinates of exons means issuing a new transcript_id (and I can understand the arguments for that), then there would never be any transcript version because all a transcript annotation just is the start and end coordinates of its exons and nothing else.
As to why the annotators make decide to use versions, rather than issuing a new transcript_id, imagine that there has been 20 years worth of research on the transcript NM12345, then then someone discovers that the 5' end of the final exon is 3nt out. It might even be that the 3' end of the penultimate exon is 3nt also, meaning and this happens to lead to no change in the coding sequence. Should all papers now cite a new transcript id (NM12346), meaning that unless you were very involved inthe subject, you would even connect the 20 years of literature on NM12345 to NM12346?