So, within the GENCODE GTF file, I noticed that each (protein-coding) gene has multiple "transcript", Am I right in saying that the start/end (for + and - strand respectively) coordinate of each transcript of a gene would be the TSSs of that gene?
So for example, gene A (+ strand) have 3 transcripts, then Am I right in saying that the START coordinate of each of this transcript represent the 3 TSSs of gene A?
"Genes" could have multiple TSS (at least theoretically). Each transcript has a transcription start site. Once you get to GTF level (which is a lot more technical than basic molecular biology), you've got to think using transcript as a unit and not gene.
So for example, gene A (+ strand) have 3 transcripts, then Am I right in saying that the START coordinate of each of this transcript represent the 3 TSSs of gene A?
Should've read this before typing my reply - you nailed it.
But consider this scenario:
what if a particular transcript of a gene exists as a result of first-exon being SPLICED OUT? As I understood it GENCODE will not annotate the skipped exon as part of this transcript, so then it would be wrong to define the start site of this transcript as independent TSS of that gene
(e.g transcript B can still be biologically transcribed from the same promoter of transcript A and thus have the same TSS, but since the first exon of transcript B is SPLICED OUT…. in GENCODE gtf file it will look like as if the TSS of transcript B start downstream of transcript A and I’ll define 2 TSSs for this gene, although biologically they may have the same TSS afterall)
Do you think assuming the start coordinate of each transcript of a gene as an independent TSS is still a legit assumption at least? (or am I just being too paranoid here)
This is uncomfortable territory for me - I'm a computer guy mostly. I think it's safe to say that each transcript has a TSS and the consensus can be stated as the gene's TSS. Another option might be the TSS of the canonical/MANE-Select transcript. But you probably are justified in thinking that the transcript level TSS is more accurate and accounts for all edge cases than any single TSS for a gene.
Geneid Genes (geneid.txt.gz) is an older transcript predictor algorithm that is based on the genome sequence alone and only relevant when you are working on a particular locus where you think that the manually curated gene models (Ensembl and RefSeq) have errors.
UCSC RefSeq (refGene.txt.gz) is NCBI RNA reference sequences aligned against the human genome using the Blast-Like Alignment Tool of the UCSC Genome Browser. The track shows known human protein-coding and non-protein-coding genes.
You can use the Table Browser to extract information of start sites (TSS) protein-coding genes. For example, to query the UCSC RefSeq (refGene) on hg38, navigate to the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) and make the following selections:
Under Select dataset:
clade: Mammal
genome: Human
assembly: Dec. 2013 (GRCh38/hg38)
group: Genes and Gene Predictions
track: NCBI RefSeq
table: UCSC RefSeq (refGene)
Set the region: to “genome”
Click create next to “filter:”
On the “Filter on Fields from hg38.refGene” page, insert “cdsStart” next to cdsEnd is, change ignored to “!=” then click submit
Set the output format to “Selected fields from primary and related tables”. This will allow you to select fields of interest. Click get output
On the following page, scroll down to the Linked Tables section and select "hgFixed refLink" then click allow selection from checked tables
You can then select the following fields:
name Name of gene
chrom Reference sequence chromosome or scaffold
strand + or - for strand
txStart Transcription start position
protAcc protein accession
Click get output
This should display all the genes with their transcription start sites and protein accession numbers.
If you have any follow up questions, our public help desk can always be reached at genome@soe.ucsc.edu. You may also send questions to genome-www@soe.ucsc.edu if they contain sensitive data. For any Genome Browser questions on Biostars, the UCSC tag is the best way to ensure visibility by the team.
You should get the GTF file from GENCODE. That has information on protein coding transcripts and by extension, genes.
Thank you. It really helped.
I am quite confused with GENCODE GTF file,
So, within the GENCODE GTF file, I noticed that each (protein-coding) gene has multiple "transcript", Am I right in saying that the start/end (for + and - strand respectively) coordinate of each transcript of a gene would be the TSSs of that gene?
So for example, gene A (+ strand) have 3 transcripts, then Am I right in saying that the START coordinate of each of this transcript represent the 3 TSSs of gene A?
"Genes" could have multiple TSS (at least theoretically). Each transcript has a transcription start site. Once you get to GTF level (which is a lot more technical than basic molecular biology), you've got to think using transcript as a unit and not gene.
Should've read this before typing my reply - you nailed it.
I seee,
But consider this scenario: what if a particular transcript of a gene exists as a result of first-exon being SPLICED OUT? As I understood it GENCODE will not annotate the skipped exon as part of this transcript, so then it would be wrong to define the start site of this transcript as independent TSS of that gene
(e.g transcript B can still be biologically transcribed from the same promoter of transcript A and thus have the same TSS, but since the first exon of transcript B is SPLICED OUT…. in GENCODE gtf file it will look like as if the TSS of transcript B start downstream of transcript A and I’ll define 2 TSSs for this gene, although biologically they may have the same TSS afterall)
Do you think assuming the start coordinate of each transcript of a gene as an independent TSS is still a legit assumption at least? (or am I just being too paranoid here)
This is uncomfortable territory for me - I'm a computer guy mostly. I think it's safe to say that each transcript has a TSS and the consensus can be stated as the gene's TSS. Another option might be the TSS of the canonical/MANE-Select transcript. But you probably are justified in thinking that the transcript level TSS is more accurate and accounts for all edge cases than any single TSS for a gene.
I seeee
Thank you for your input :) !!