Question

Help understanding enhancer and promoter annotation

0

Entering edit mode

22 months ago

alejandrarodrigu21 • 0

Hello everyone,

I am creating this post because I feel a bit overwhelmed and confused regarding the concept of "annotation". I will provide some context for you to understand where is my confusion arising from.

First of all, I understand that in bioinformatics, when you have a .bed file that contains information on chromosome, start and end positions, the process through which you would try and check if those regions are related to a gene, is called annotation.

Following that definition, I have been trying to work on this paper where I started with some fastq files (bisseq data), performed a methylation analysis using RnBeads and selected the regions of interest based on a specific set of targets that we had and that were the sections that we wanted to examine a bit more thoroughly. I have taken these regions, parsed them to a right format for them to be annotated (see above definition of annotation) with the package "annotatr".

To provide some more technical details, what I have explicitly done is define which annotations I'd like to use by defining the variable "annots", then built annotations for hg19 (as this is the build we are meant to be using) and used the function "annotate_regions" to obtain a GRanges object with all the information from my original bed files and the corresponding annotated genes. I have provided a template piece of code below to show exactly the tools I am using for this annotation process:

annots = c('hg19_cpgs', 'hg19_lncrna_gencode', 'hg19_genes_3UTRs', 'hg19_genes_5UTRs','hg19_enhancers_fantom')
annotations= build_annotations(genome='hg19', annotations=annots)
data.file <- 'myfile.bed'
myregions <- read_regions(con=data.file, genome='hg19', format='bed')
intersectedregions <- annotate_regions(regions=myregions, annotations=c(annotations), ignore.strand=TRUE, quiet=FALSE)
dataframe =data.frame(instersectedregions)

As I am trying to annotate enhancers and promoters, I have also used the function build_annotations to create my own custom annotation for promoters as the base access was not working:

annots_prom = build_gene_annots(genome='hg19', annotations = 'hg19_genes_promoters')
annots_promGR <- annots_prom$hg19_genes_promoters

After this, I used annotate_regions() as displayed before to obtain the promoter annotations.

Now, the issues I am encountering here is that it does not seem like I can "define" the promoter regions or the enhancer regions, and I have tried to look online but I've not found any package in R that could allow me to do so. However, my supervisor mentioned that I should see the promoter and enhancer annotation to be -10/10kb.

After giving all this context, my confusion here arises from the fact that I keep reading a lot about annotation and a "right" way of doing so, and to my understanding, this shouldn't be too complex as a method per se, as the accuracy will depend on how good the database is at relating chromosome regions with specific genes, and not on how am I performing the overlap search - because I understand that this is all it is, basically, just an overlap search between my .bed file and the corresponding database that has information for promoters, enhancers, lncRNAs, etc.

I do have a doubt though, and this is that I have only found this package that allows me to annotate enhancers, and it is using fantom5 - I am aware that there is a fantom6 already, but I cannot for the love of me figure out how can I use it instead. However, when I explore it, it looks like it is identifying regions as enhancers but presents no associated genes whatsoever - if you explore the GRanges object, there is a gene_id and a symbol column that are just full of "NA" values. I do not understand why is there no associated gene names to any of these so-called enhancer regions, and I am getting VERY confused about this.

Am I missing something here? Does annotation as a technique in bioinformatics include other subtleties that maybe I am not understanding?

Also, is there any tutorials on how to perform basic annotation using any package in R? I have searched but I found nothing of interest, and I am not quite sure that my method is 100% aside from the fact that I followed 'annotatr' documentation and tweaked a couple of things here and there.

Finally, does ANYONE know about how can I "define" these windows to get my regions to annotate to +/-10kb?

If you have read to the end - thank you for your time, and I would appreciate any ideas, comments, links or advice that you may have to tackle this issue.

enhancers GenomicRanges annotatr annotation • 2.5k views

ADD COMMENT • link 10 months ago by alejandrarodrigu21 • 0

0

Entering edit mode

First of all, I understand that in bioinformatics, when you have a .bed file that contains information on chromosome, start and end positions, the process through which you would try and check if those regions are related to a gene, is called annotation.

Yes and No. Annotation covers a broader exercise - it essentially means adding known information appropriately to an unknown dataset based on some sort of overlap. More often than note, the overlap is genomic positions. Most annotation operations involve one or more "ROD" files such as BED, VCF or plain TSV.

ADD REPLY • link 22 months ago by Ram 45k

0

Entering edit mode

Right, that makes sense - but then within my specific case where I have a .bed file and am interested in finding specific information using a database, in this case find out which of my regions are associated with promoters, which with enhancers, etc. - the definition I gave would apply, right? What about the rest of the question, though?

ADD REPLY • link 22 months ago by alejandrarodrigu21 • 0

score 2 · Answer 1 · 2023-07-13

First, annotation can have different contexts, so there may be some confusion there. Critically, what matters is what you are annotating. In general, when annotations are mentioned, they refer to genome annotations which could be genes, enhancers, promoters, etc.... However, if you want to annotate your bed file, then you may be adding information on the nearest gene, (an annotation of the annotations, if you will).

However, I think you are correct in thinking that annotations are very simple ideas. It essentially all relies on people defining regions of the genome as genes or enhancers, or defining what type of gene it is or which enhancer it interacts with. These definitions can come from experimental evidence or computational prediction.

Your task at hand should be very simple (easy in the sense you mentioned, where all you are doing is asking, "do these genomic regions overlap, if yes, label with corresponding gene/enhancer"), but in my experience when you don't know exactly where to look the task can be relatively difficult.

To accomplish your task, I would be most comfortable using bedtools on the command line. I am not very familiar with using R for this, but have you looked into genomicRanges functions like nearest or findOverlaps?

Another, very roundabout, idea could be to just expand your annotations by 10kb both ways, then run your commands. This would allow any annotation within 10kb to overlap with your region of interest and be detected by annotate_regions. However, I am not sure how multiple overlaps are handled or how you desire them to be handled.

I do not understand why is there no associated gene names to any of these so-called enhancer regions, and I am getting VERY confused about this.

We do not always know which gene the enhancers are associated with. Many times, you may annotate them with the nearest gene, or vice-versa, or have some other experimental evidence of enhancer-promoter association. Also, sometimes enhancers are annotated with the gene name they overlap, but many enhancers have no gene overlap.