Hello everyone,
I am creating this post because I feel a bit overwhelmed and confused regarding the concept of "annotation". I will provide some context for you to understand where is my confusion arising from.
First of all, I understand that in bioinformatics, when you have a .bed file that contains information on chromosome, start and end positions, the process through which you would try and check if those regions are related to a gene, is called annotation.
Following that definition, I have been trying to work on this paper where I started with some fastq files (bisseq data), performed a methylation analysis using RnBeads and selected the regions of interest based on a specific set of targets that we had and that were the sections that we wanted to examine a bit more thoroughly. I have taken these regions, parsed them to a right format for them to be annotated (see above definition of annotation) with the package "annotatr".
To provide some more technical details, what I have explicitly done is define which annotations I'd like to use by defining the variable "annots", then built annotations for hg19 (as this is the build we are meant to be using) and used the function "annotate_regions" to obtain a GRanges object with all the information from my original bed files and the corresponding annotated genes. I have provided a template piece of code below to show exactly the tools I am using for this annotation process:
annots = c('hg19_cpgs', 'hg19_lncrna_gencode', 'hg19_genes_3UTRs', 'hg19_genes_5UTRs','hg19_enhancers_fantom')
annotations= build_annotations(genome='hg19', annotations=annots)
data.file <- 'myfile.bed'
myregions <- read_regions(con=data.file, genome='hg19', format='bed')
intersectedregions <- annotate_regions(regions=myregions, annotations=c(annotations), ignore.strand=TRUE, quiet=FALSE)
dataframe =data.frame(instersectedregions)
As I am trying to annotate enhancers and promoters, I have also used the function build_annotations to create my own custom annotation for promoters as the base access was not working:
annots_prom = build_gene_annots(genome='hg19', annotations = 'hg19_genes_promoters')
annots_promGR <- annots_prom$hg19_genes_promoters
After this, I used annotate_regions()
as displayed before to obtain the promoter annotations.
Now, the issues I am encountering here is that it does not seem like I can "define" the promoter regions or the enhancer regions, and I have tried to look online but I've not found any package in R that could allow me to do so. However, my supervisor mentioned that I should see the promoter and enhancer annotation to be -10/10kb.
After giving all this context, my confusion here arises from the fact that I keep reading a lot about annotation and a "right" way of doing so, and to my understanding, this shouldn't be too complex as a method per se, as the accuracy will depend on how good the database is at relating chromosome regions with specific genes, and not on how am I performing the overlap search - because I understand that this is all it is, basically, just an overlap search between my .bed file and the corresponding database that has information for promoters, enhancers, lncRNAs, etc.
I do have a doubt though, and this is that I have only found this package that allows me to annotate enhancers, and it is using fantom5 - I am aware that there is a fantom6 already, but I cannot for the love of me figure out how can I use it instead. However, when I explore it, it looks like it is identifying regions as enhancers but presents no associated genes whatsoever - if you explore the GRanges object, there is a gene_id and a symbol column that are just full of "NA" values. I do not understand why is there no associated gene names to any of these so-called enhancer regions, and I am getting VERY confused about this.
Am I missing something here? Does annotation as a technique in bioinformatics include other subtleties that maybe I am not understanding?
Also, is there any tutorials on how to perform basic annotation using any package in R? I have searched but I found nothing of interest, and I am not quite sure that my method is 100% aside from the fact that I followed 'annotatr' documentation and tweaked a couple of things here and there.
Finally, does ANYONE know about how can I "define" these windows to get my regions to annotate to +/-10kb?
If you have read to the end - thank you for your time, and I would appreciate any ideas, comments, links or advice that you may have to tackle this issue.
Yes and No. Annotation covers a broader exercise - it essentially means adding known information appropriately to an unknown dataset based on some sort of overlap. More often than note, the overlap is genomic positions. Most annotation operations involve one or more "ROD" files such as BED, VCF or plain TSV.
Right, that makes sense - but then within my specific case where I have a .bed file and am interested in finding specific information using a database, in this case find out which of my regions are associated with promoters, which with enhancers, etc. - the definition I gave would apply, right? What about the rest of the question, though?