Entering edit mode
7.4 years ago
Marvin
▴
220
Hello, I'm looking at a record from a GTF file:
18 protein_coding CDS 2554668 2554691 . - 2 gene_id "ENSG00000101574"; transcript_id "ENST00000576251"; exon_number "1"; gene_name "METTL4"; gene_biotype "protein_coding"; transcript_name "METTL4-010"; protein_id "ENSP00000460774";
If this is exon_number 1, how can it have a frame of 2 (I expect 0) ?
According to the documentation this means that the third base of this sequence is the first base of a codon. So what about the first two bases of this sequence then? Since this is exon 1? Where is the missing base? Do you know what I mean?
Just in this moment it clicked and I have understood what you meant 2 weeks ago :D
I do not know how to explain it to others but I highly recommend this: download the .gtf file from the ENSEMBL ftp server. check out the following transcript:
You will notice it has 4 exons. Pick exon_number "1" and enter its coordinates into UCSC genome browser hg19 like this:
Notice how I extended the interval at both sides by 1 nucleotide. Now in UCSC you will find the according transcript among others. You will see that the "intron arrows" of this exon point to the LEFT (extending the interval by 1 base makes this visible). That means (as Devon said) that the gene is on the minus strand. And now you can clearly see how it is correct that the left-most position in this CDS does NOT have frame 0. The last exon (exon 4) has frame 0.
I got it now, thanks for your reply Devon :)
Or you could just look at column 7 of the GTF. -/+
I think you misunderstood the purpose of my post: The idea was not to go to UCSC in order to see on which strand the gene is. Instead the idea was to walk through an example that makes you _understand_ (and see with your own eyes) why exon 1 doesn't necessarily have frame 0.