Hello everyone, I have coordinates for a set of retained introns and I would like to obtain the Ensembl transcript Ids that contain these retained introns. Also, I would like to extract the information about the "type" of transcript (like known protein coding or known nonsense mediated decay).
I can do this manually by going to the UCSC browser and checking which Ensembl transcripts contain these retained introns and then from the Ensembl IDs, I can check the "type" of these transcripts from Ensembl website. But, I am wondering if there is a quick way to do this. This would be very useful if I have a long list of retained introns.
I would appreciate any help. Thanks so much.
Retained introns category is the transcript type for transcripts that retain intronic regions. So you will not get other transcript types (such as protein coding on NMD) for transcripts classified as retained introns.
Thanks so much for your comment. I have looked into several Ensembl-annotated transcripts that contain retained introns (they showed up in our splicing analysis) and most of them are classified as "known retained introns" by Ensembl. But couple of them are also grouped as "known protein coding" or "known nonsense-mediated decay" or "known processed transcript".
Do you know if the transcripts that are classified as "known retained introns" are predicted to not undergo NMD and also not have the potential to be translated? Thanks for your help.
Can you give me some examples please?
Sure. Here is one example from each subtype:
Example 1: Gene, ORMDL1; ENST00000458355; Chr2; coordinates of exon with retained intron (hg19), 190647147-190647849; subtype, protein coding
Example 2: Gene, SLC17A9; ENST00000488738; Chr20; coordinates of exon with retained intron (hg19), 61593975-61594721; subtype, known processed transcript
Example 3: Gene,PNISR; ENST00000478777; Chr6; coordinates of exon with retained intron (hg19), 99851704-99852578; subtype, known NMD
I took the Ensembl transcript that contains the exon with retained intron. It would be great to have your input. Thanks.
The genes will be protein coding (gene biotype) but they will have different transcripts, each of them with different biotypes including non-coding transcript biotypes. Usually (if not always) the retained intron category is manually annotated by HAVANA based on their guidelines and a few exceptions to the rule can cause the "discrepancies" that your splicing analysis has shown up.
For example 1, ENST00000458355 is not a retained intron, although it may seem like at first glance (check more examples like that in page 34 of the HAVANA guidelines). However, the entire retained intron seems to open and in-frame with its flanking coding exons, therefore it was annotated as coding. Moreover, the new exon is at the 5' end of the transcript, the annotators have added the flag "alternative 5' UTR", which can be seen on the VEGA browser. From the Ensembl browser, you can seamlessly jump to the VEGA counterpart. You may want to contact HAVANA as it seems an additional remark (flag) is missing i.e. retained intron first (page 39 of the guidelines).
I'd have thought that ENST00000488738 is as retained intron transcript, so you better check this directly with the HAVANA guys (try using the Gencode help email) so that they can explain why it's been annotated as processed transcript.
Finally, example 3: the retained intron creates a premature stop codon which is further than 50 nt away from a downstream splice junction. Check page 40 of the HAVANA guidelines for manual annotation.
Thanks for all your suggestions and links. They are very helpful. I gave you three examples, but I have other transcripts in those three categories as well, so I am going to check them again. I am very interested in retained introns, so all the information you provided would help a lot.