Could anyone please clarify whether or not in computational studies one can treat the transcription start site (TSS) as being equivalent to the first exon?
Some seemingly contradictory quotes (and sources) that I've found on this issue:
Promoter sequences are usually the sequence immediately upstream the transcription start site (TSS) or first exon.
SOURCE: http://www.protocol-online.org/forums/blog/4/entry-10-from-how-to-find-promoter-sequences/
The TSS is the first nucleotide of the UTR (at least I think so, I don't think there's any gene which immediately begins with ATG), so yes, UTRs can also be 'relaxed' and differ in length.
No, the first codon of the first exon is the start codon "ATG" which also codes for methionine. This is called the translation start site. The transcription start site is where the RNA polymerase binds to in the 5' UTR upstream of the start codon. IMHO Maybe someone else can elaborate more. I dont want to give you the incorrect info.
SOURCE: http://seqanswers.com/forums/archive/index.php/t-12773.html
Could you please elaborate on Steve Lianoglou's comment below?
That comment was attached to my answer which I should not have posted whilst half-asleep :) and have since deleted. I was thinking exclusively about protein-coding genes, in which the first exon is the translational start of the protein. However, exon can also be defined as "what's left in the mature RNA after splicing" and may have nothing to do with protein coding. In any case, TSS is not equivalent.
Sorry to be a mosquito, but I'll still argue that even in protein coding genes, the first exon is not defined by the translation start. An exon is (and should only ever be) defined by "what's left in the mature RNA after splicing", and (for instance) the "spliced bits" in the 5'UTR of the human ALAS1 gene are still called "exons."
Exons are defined by
the splicing machineryRNA processing machinery, not the translational machinery.I've struck through "splicing machinery" because we have cases like XBP1, which is post-transcriptionally processed by ERN1 that splits one exon, into two -- and I don't think anyone would call ERN1 part of any splicing machinery.
All that having been said -- do you have any references where people are actually going by the definition you are proposing?
I'm not proposing a definition, I'm using sloppy language which you are doing a great job of making less sloppy. I agree, exons are not "defined" by translational machinery. I guess I was trying to keep things simple in the context of the original question, to which the answer is "no, TSS is not first exon".
P.S. in my day, i.e. about 20 years ago, we remembered which were introns and which were exons by "exons are expressed". This newfangled definition of "exons are what's left after splicing" does not sit well in my old brain at all :)
Please give an example in which the TSS is not the start of the first exon. I have shown below that for all the major mouse annotations TSS == start of first exon. I would say biochemically the reason for this, at least for protein coding mRNAs, is the 5' cap which gets attached at transcription initiation to the 5' end and is necessary for export, translation (splicing?) etc...
circular RNAs might be a counter example
How I wish I could turn back time and start again with my answers; it appears I'm just confusing myself and everyone else :)
OK: if we are including UTRs in the definition of exon and if we're assuming that transcript starts in the UCSC database really are transcript starts (I have always wondered how many are experimentally-determined) then yes - the TSS is equivalent to the first position in the first exon.
If you're an old fart like me who was taught that exon = expressed region - which is now incorrect - then TSS is nothing to do with exons at all.
I hope this helps.
The sequence ontology also defines the exon as:
A region of the transcript sequence within a gene which is not removed from the primary RNA transcript by RNA splicing.
But of course one thing I learned in biology that there are always exceptions as Steve Lianoglou points out.
How does one computationally find the TSS of a gene? What if there are multiple genes involved and you must resort to computational measures (not experimental measures like RACE)
So far as I know, whilst there are computational methods for TSS prediction (which you can find by web/literature search), only experimental methods such as RACE provide this information.