Hi,
I want to extract the promoter of every protein-coding genes in the genome. My colleagues suggested to use GENCODE GTF annotation file
However, upon looking at the file content, I didn't see any "promoter" defined anywhere?
For example, here's the GENCODE entry for the "TERT" gene: (GENCODE v43)
My simple mind would simply take, say 1000bp upstream of the "start" coordinate under the "gene" feature (first row) and assume that every distinct transcript of the gene have the same promoter, Is this a sensible thing to do? Or am I completely wrong here?
Thank you so much!
You won't find a GTF of "official" human promoter regions. Most genome-wide promoter annotations are inferred from ChIP-seq studies looking at histone modifications and TF binding. There are some databases floating around that people have put together based on ChIP-seq data and/or other data.
To better understand how to advise you, can you elaborate on what you want to do with the promoter regions?
Thanks for your insights
Are you familiar with Hi-C? What I want to do is find the genomic regions/fragments that interact with the promoter of every protein coding genes
In essence, the result will be similar to if I had done Promoter Capture Hi-C instead (Promoter Capture Hi-C is a version of Hi-C that only identify the interaction with gene promoters), Now.. I wonder if promoters are inconsistent and incompletely annotated, how do people do Promoter Capture Hi-C then hmm.. I thought this will require the existence of some "official" promoters of every gene
Does your Hi-C really has a resolution to do promoter-level analysis, which would almost be 1kb resolution? For starters it could be possible to simply use 1kb upstream of every annotated TSS (respective strand).
The thing is, GENCODE doesn't annotate TSS either
Sooo by TSS, do you mean just the "start" ("end" for neg. strand) coordinate of the row labelled "gene" in the GTF file? (the first row on pic above).
And yes, the Hi-C library have a resolution of exactly 1kb,
Just out of interest, how many billion reads are necessary for 1kb?
Unfortunately, I'm not the one who make the hi-c library (It was one of the postdoc in my lab), so I'm not familiar with that
All I'm getting is a clean interaction data