I have seen many many papers mentioning "GENCODE TSS"
However, upon looking at the GENCODE GTF file downloaded from the GENCODE website (e.g. gencode.vXX.annotation.gtf.gz), I didn't see any obvious "TSS" entry.
So, how does one goes about defining "GENCODE TSS? What does this statement EVEN MEAN??
My theory: So, within the GENCODE GTF file, I noticed that each (protein-coding) gene has multiple "transcript", Am I right in saying that the start/end coordinate (for + and - strand respectively) of each transcript of a gene would be the TSSs of that gene?
So for example, gene A (+ strand) have 3 transcripts, then Am I right in saying that the START coordinate of each of this transcript represent the 3 TSSs of gene A?
HOWEVER, How do you differentiate the case where for an alternate transcript of gene A, the first exon is NOT the first transcribed exon (due to splicing).
In this case wouldn’t it be wrong to define the start site of that exon as the TSS? (The real TSS should be attached to the spliced out exon instead).
What do you guys think of this?
uhhh ? aren't you mixing the initiation of transcription and the initiation of translation here ?
I dont think im confusing the two, is what I meant:
I originally said that the “start” coordinate of every entry termed “transcript” under each gene would be the TSSs for that gene… (so a gene would have 3 TSSs if that gene have 3 alternate transcripts)
But is this really the case everytime? Like I said above, what if a particular alternate transcript of a gene exists as a result of first-exon skipping? As I understood it GENCODE will not annotate the skipped exon as part of this transcript, so then it would be wrong to define the start site of this transcript as independent TSS of that gene
(e.g transcript B can still be biologically transcribed from the same promoter of transcript A and thus have the same TSS, but since the first exon of transcript B is skipped…. in GENCODE gtf file it will look like as if the TSS of transcript B start downstream of transcript A and I’ll define 2 TSSs for this gene, although biologically they may have the same TSS afterall)
Does this make sense?
checkout: chipseeker package
there is a function which gets you the tss
getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
as you can see it reaches out to UCSC genome annotation. There are public data sets that have been deposited and used to find genomic elements such as TSS. I guess the same applies to Encode