Question

TSS of protein coding genes

1

Entering edit mode

4.3 years ago

arsala521 ▴ 60

Hi everyone,

I want to have transcription start sites (TSS) of all protein-coding genes in the genome. There is a couple of things I want to ask about.

I found two relevant files for gene coordinates at UCSC browser: refGene.txt.gz (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz), and geneid.txt.gz (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/geneid.txt.gz). I found them to have same fields. Can someone recommend which should be used?

Also can someone please suggest a way to extract information of only protein-coding genes from these files?

Thanks in advance

TSS protein coding genes ucsc • 3.9k views

ADD COMMENT • link updated 20 months ago by ConvolutedGenome ▴ 50 • written 4.3 years ago by arsala521 ▴ 60

1

Entering edit mode

You should get the GTF file from GENCODE. That has information on protein coding transcripts and by extension, genes.

ADD REPLY • link 4.3 years ago by Ram 44k

0

Entering edit mode

Thank you. It really helped.

ADD REPLY • link 4.3 years ago by arsala521 ▴ 60

0

Entering edit mode

I am quite confused with GENCODE GTF file,

So, within the GENCODE GTF file, I noticed that each (protein-coding) gene has multiple "transcript", Am I right in saying that the start/end (for + and - strand respectively) coordinate of each transcript of a gene would be the TSSs of that gene?

So for example, gene A (+ strand) have 3 transcripts, then Am I right in saying that the START coordinate of each of this transcript represent the 3 TSSs of gene A?

ADD REPLY • link 20 months ago by ConvolutedGenome ▴ 50

1

Entering edit mode

"Genes" could have multiple TSS (at least theoretically). Each transcript has a transcription start site. Once you get to GTF level (which is a lot more technical than basic molecular biology), you've got to think using transcript as a unit and not gene.

So for example, gene A (+ strand) have 3 transcripts, then Am I right in saying that the START coordinate of each of this transcript represent the 3 TSSs of gene A?

Should've read this before typing my reply - you nailed it.

ADD REPLY • link 20 months ago by Ram 44k

0

Entering edit mode

I seee,

But consider this scenario: what if a particular transcript of a gene exists as a result of first-exon being SPLICED OUT? As I understood it GENCODE will not annotate the skipped exon as part of this transcript, so then it would be wrong to define the start site of this transcript as independent TSS of that gene

(e.g transcript B can still be biologically transcribed from the same promoter of transcript A and thus have the same TSS, but since the first exon of transcript B is SPLICED OUT…. in GENCODE gtf file it will look like as if the TSS of transcript B start downstream of transcript A and I’ll define 2 TSSs for this gene, although biologically they may have the same TSS afterall)

Do you think assuming the start coordinate of each transcript of a gene as an independent TSS is still a legit assumption at least? (or am I just being too paranoid here)

ADD REPLY • link 20 months ago by ConvolutedGenome ▴ 50

1

Entering edit mode

This is uncomfortable territory for me - I'm a computer guy mostly. I think it's safe to say that each transcript has a TSS and the consensus can be stated as the gene's TSS. Another option might be the TSS of the canonical/MANE-Select transcript. But you probably are justified in thinking that the transcript level TSS is more accurate and accounts for all edge cases than any single TSS for a gene.

ADD REPLY • link 20 months ago by Ram 44k

0

Entering edit mode

I seeee

Thank you for your input :) !!

ADD REPLY • link 20 months ago by ConvolutedGenome ▴ 50

score 4 · Accepted Answer · 2021-10-25

Hello,

Geneid Genes (geneid.txt.gz) is an older transcript predictor algorithm that is based on the genome sequence alone and only relevant when you are working on a particular locus where you think that the manually curated gene models (Ensembl and RefSeq) have errors.

UCSC RefSeq (refGene.txt.gz) is NCBI RNA reference sequences aligned against the human genome using the Blast-Like Alignment Tool of the UCSC Genome Browser. The track shows known human protein-coding and non-protein-coding genes.

See our FAQ page for more information: http://genome.ucsc.edu/FAQ/FAQgenes.html#genename

You can use the Table Browser to extract information of start sites (TSS) protein-coding genes. For example, to query the UCSC RefSeq (refGene) on hg38, navigate to the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) and make the following selections:

Under Select dataset:

clade: Mammal

genome: Human

assembly: Dec. 2013 (GRCh38/hg38)

group: Genes and Gene Predictions

track: NCBI RefSeq

table: UCSC RefSeq (refGene)
Set the region: to “genome”
Click create next to “filter:”
On the “Filter on Fields from hg38.refGene” page, insert “cdsStart” next to cdsEnd is, change ignored to “!=” then click submit
Set the output format to “Selected fields from primary and related tables”. This will allow you to select fields of interest. Click get output
On the following page, scroll down to the Linked Tables section and select "hgFixed refLink" then click allow selection from checked tables
You can then select the following fields:

name Name of gene

chrom Reference sequence chromosome or scaffold

strand + or - for strand

txStart Transcription start position

protAcc protein accession
Click get output

This should display all the genes with their transcription start sites and protein accession numbers.

If you have any follow up questions, our public help desk can always be reached at genome@soe.ucsc.edu. You may also send questions to genome-www@soe.ucsc.edu if they contain sensitive data. For any Genome Browser questions on Biostars, the UCSC tag is the best way to ensure visibility by the team.