What are the best databases to check out the transcription start sites of specific genes in human genome?
What are the best databases to check out the transcription start sites of specific genes in human genome?
wget -q -O - "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/wgEncodeGencodeBasicV19.txt.gz" | gunzip -c | awk '(int($7)< int($8)) {if($4=="+") {printf("%s\t%d\t%d\t%s\t%s\n",$3,$7,int($7)+1,$2,$4);}else {printf("%s\t%d\t%d\t%s\t%s\n",$3,int($8)-3,$8,$2,$4);}}'
chr1 69090 69091 ENST00000335137.3 +
chr1 139306 139309 ENST00000423372.3 -
chr1 367658 367659 ENST00000426406.1 +
chr1 622031 622034 ENST00000332831.2 -
chr1 739134 739137 ENST00000599533.1 -
chr1 818042 818043 ENST00000594233.1 +
chr1 861321 861322 ENST00000342066.3 +
chr1 866442 866445 ENST00000598827.1 -
chr1 894617 894620 ENST00000327044.6 -
chr1 896073 896074 ENST00000338591.3 +
Basically any GTF file, from RefSeq, Ensembl, GENCODE. It is the start coordinate of the entries with type transcript
. Be aware that for genes on the bottom strand it would be the end coordinate, but most GTFs even have a TSS entry that you can use directly.
Here is a simple pythonic way to use biomart:
import pybiomart as pbm
dataset = pbm.Dataset(name='hsapiens_gene_ensembl', host="http://sep2019.archive.ensembl.org/")
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
Below is how annot results look like:
Chromosome/scaffold nameTranscription start site (TSS) Strand Gene name Transcript type MT 577 1 MT-TF Mt_tRNA MT 648 1 MT-RNR1 Mt_rRNA MT 1602 1 MT-TV Mt_tRNA MT 1671 1 MT-RNR2 Mt_rRNA MT 3230 1 MT-TL1 Mt_tRNA ... ... ... ... ... ... chr1 228416627 -1 TRIM17 protein_coding chr1 228416652 -1 TRIM17 protein_coding ... ... ... ... ... ...
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You can find TSS for all transcripts of a given gene by querying Biomart
Seems that DBTSS doesn't work!
you can use bioconductor as shown in this post using Genomicanges https://support.bioconductor.org/p/46508/