I would like to find the lengths of UTR3, UTR5 and ORF regions for each transcript of protein coding genes. My plan was to find the ORF length looking at the coordinates of start and stop codons and the lengths of the UTRs by their coordinates. Problem is that some transcripts have 0 (or 2) start/stop codons, 0 (or multiple) UTRs, while I expected 1 start codon, 1 stop codon and 2 UTRs.
I could discard those transcripts looking at how many start/stop codons or UTRs I have for each transcript but I wanted to know if there is a better way to do it.
Looking at tags for some transcripts, I noticed that when there were multiple UTRs there was the 'alternative_3_UTR'/'alternative_5_UTR' tag, when there was no stop codon or start codon the 'cds_end_NF'/'cds_start_NF' tags, when there were less than 2 UTRs the 'mRNA_start_NF'/'mRNA_stop_NF' tags.
After filtering looking at tags the number of transcripts with an unexpected number of start/stop codons or UTRs was greatly reduced but was not 0.
Am I making wrong assumptions about the tags? Am I missing something?
I am new to the topic and I'm quite confused by this, I understand why it is possible to have no start/stop codon but don't get why some transcripts have 2, looking at coordinates it seems like the 3 bases of a codon are split in different areas? It looks like 1 of them has length 1 and the other length 2 and I guess that together they would make one codon but noticing that didn't really help with understanding how that works. Any suggestion on material where I could learn are welcome
Hi,
You could use the UCSC Table Browser, select the assembly, select from the group "Genes and Genes Preddictions" the track you want and then select the BED as the output format. After submission "get output", you will be redirected to another page and you can select to create one BED record per 5' UTR Exons or 3' UTR Exons.
I don't see my reply showing up, I guess I didn't send it yesterday, what I said was that it seems that the result I get using the UCSC Table Browser is the same as what I got from the gtf file, there are still multiple UTRs corresponding to the same transcript ID even after filtering out alternative UTRs, I don't know how to handle those, shouldn't a transcript have only one 3' UTR and one 5' UTR?