TSS of canonical transcripts
2
0
Entering edit mode
4.2 years ago
arsala521 ▴ 60

Hi everyone,

I need TSS (transcription start site) of all the protein coding genes in human genome. I only want to focus on canonical transcripts and want one TSS per gene. Can someone please tell me which file from which source can provide this information?

I tried refGene.txt file from UCSC, and GENCODE GTF file for basic gene annotation but both provide multiple TSSs for single gene. I looked into refseq_select dataset (that consist of representative transcript of every gene) but I think it hasn't been quality checked and released.

Any suggestions would be really helpful.

TSS canonical transcripts • 2.8k views
ADD COMMENT
2
Entering edit mode
4.2 years ago
Amitm ★ 2.3k

Hi, Have you seen the 'MANE' project. Details here. The 'MANE Select' select provides one transcript per gene. And there are multiple ways to access the data; details towards the bottom of the above page. I checked, and I was hoping, to get that info. from Ensmbl BioMart but it seems that has not been implemented yet.

ADD COMMENT
0
Entering edit mode

Thank you. It really helped but I think it doesn't cover all protein coding genes. By analyzing its GTF file, I found it has 16,230 genes in total.

ADD REPLY
2
Entering edit mode
4.2 years ago
ATpoint 85k

As you are interested in protein-coding genes you could use the APPRIS database. This db provides one so-called PRINCIPAL isoforms per gene which is supposed to be the biologically most meaning one, check the documentation and paper for details. There are still some genes that have multiple principal ones or where assignments are ambiguous but I guess this is a good starting point. APPRIS provides these information for download from its website, and APPRIS scores can also be obtained e.g. from Ensembl via the R package biomaRt.

ADD COMMENT
0
Entering edit mode

Thank you. APPRIS does provide the required information but I am not sure how to extract it. The APPRIS file for GENCODE 34 dataset is like this:

SCYL3   ENSG00000000457 ENST00000367772 CCDS1287.1  PRINCIPAL:4
SCYL3   ENSG00000000457 ENST00000367771 CCDS1286.1  ALTERNATIVE:2
SCYL3   ENSG00000000457 ENST00000367770 CCDS1287.1  PRINCIPAL:4
C1orf112    ENSG00000000460 ENST00000359326 CCDS1285.1  PRINCIPAL:1
C1orf112    ENSG00000000460 ENST00000286031 CCDS1285.1  PRINCIPAL:1

and I need the start positions of these transcripts. On GENCODE website, I can only find the file for GENCODE 35. It provides the transcript start site along with transcript ids but transcript ids do not match with the ones in the APPRIS file. Do you know anyway I can get transcript start sites using this file? Thanks for the help.

ADD REPLY
1
Entering edit mode

The transcript IDs are ENST... which are identical with GENCODE. GENCODE might have version numbers so e.g. ENSMUST00011.1, so you might need to remove the part after the dot. From there it is custom filtering. No, I do not have ready to use code. You need to first filter with the APPRIS file to get the transcript IDs you want and then extract the TSS from the GENCODE file using that information. It is cumbersome, yes.

ADD REPLY
0
Entering edit mode

Thank you. I could work in that way. The APPRIS file is for GENCODE 34 dataset and the GTF file from GENCODE is for release 35. Would it be okay to match transcript ids between these two files?

ADD REPLY
0
Entering edit mode

I guess that is fine. Would be surprised if a lot of things changed between two versions.

ADD REPLY

Login before adding your answer.

Traffic: 1614 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6