Hello,
I have a question regarding ensembl regulatory build from this link. From that link, the regulation data is in the GTF/GFF format. Below is the example of the data I downloaded:
chromosome project Name feature start end score strand frame attr 10 Regulatory_Build promoter 73594 74193 . . . "Name=Promoter;ID=ENSR00000349338;activity=inactive;bound_start=73594;bound_end=74793;Note=Consists of following features: H3K4me3,H3K4me3,H3K4me2,DNase1,CTCF(MA0139.1),CTCF(MA0139.1),CTCF(MA0139.1),Rad21" 10 Regulatory_Build promoter 76194 76793 . . . "Name=Promoter;ID=ENSR00000349339;activity=inactive;bound_start=75994;bound_end=76993;Note=Consists of following features: CTCF(MA0139.1)"
My question is, how can I extract which gene has that promoter region? I manually check this location using IGV but I don't know how to check this. The result seem weird because there is gene located in that promoter region. Anyone have some suggestion? Thank you.
Thank you for your reply. I am a bit confused with the location of the promoter. After I read some document, the promoter should be upstream of the gene location and not in the gene itself. So, what I imagine, if a gene transcription start is 100,000 and transcription end is 101,000, the promoter should be before 100,000 (if we use 2000bp as range the promoter is between 98,000-99,999). The gene GTF annotation is not overlaping with this interval. What is your opinion?
In R, one would load an appropriate txdb object and use the
promoters()
command to get the promoter intervals, thereafter using findOverlaps. For bedtools, one would first use biomart to get the promoter intervals and then usebedtools intersect
. One can think of a large number of other ways to do this.Thank you. So, I use a different way and I want to ask your opinion. First, I download the data of the transcription start site and end site from Biomart. After that, I calculate the "hypothetical" promoter region by calculating 5kb upstream of start site and downstream of end site. After that I use the bedtools to intersect and I have the gene and transcript name for regulatory region from the Ensembl. What do you think about that? Another question is, I found several regulatory region from Ensembl positioned in the gene (overlap with intron and/or exon). Do you think this has some kind of biological interpretation as inhibitor of transcription process or it just an artifacts? The Ensembl regulatory data comes from Chip-seq I think. Thank you for your comment.
Your method should work fine as well, there are many ways to go about this, all giving the same results :)
Regarding the random binding events in genes, some of these might be functional, others not. There's a lot of random binding that leads to nothing (biology is noisy after all).
Thank you very much for your suggestion.