Question

RRBS ANNOTATION

0

Entering edit mode

3.6 years ago

GiV17 ▴ 50

Hi all, I have a question: I have the rrbs Data and I have mapped on HG38 (gencode version 37). So, Now I must to annotated my data, but HOW?

Where Can I download the correct HG38 (gencode 37).bed file?
What software is the best to perform the annotation? Can You help me? Thanks all. G

NGS ANNOTATION RRBS • 1.9k views

ADD COMMENT • link 3.6 years ago by GiV17 ▴ 50

1

Entering edit mode

In R/Bioconductor, you can 1) access organism annotation packages very easily and 2) use them to annotate your genomic sites. A very easy package to use is ChIPseeker. It was developed for ChIP-seq, but it can annotate anything (in your case, intervals of 1 bp which are the CpGs). If you save your data and read it as peaks, a single function (annotatePeak) will annotate your data. Check the vignette and examples, and be aware of how the annotation is done for your case (priority in selecting overlapping annotations, or if CpGs are mapping to the closest gene, no matter the distance, etc.). I've also used annotatr which is also very convenient and comes with pre-built annotations.

ADD REPLY • link 3.6 years ago by Papyrus ★ 3.0k

0

Entering edit mode

Papyrus, thank you so much for your help and your comprehensive explanation. However I have a question: What do you mean with: "... and be aware of how the annotation is done for your case (priority in selecting overlapping annotations, or if CpGs are mapping to the closest gene, no matter the distance, etc.)" Can you explain it better? Thank you

ADD REPLY • link 3.6 years ago by GiV17 ▴ 50

1

Entering edit mode

Yes, this relates to a "problem" which does not have a unique solution, and is usually solved depending on the specific project. When we define genomic features (i.e. places in the genome which we state have some specific function, such as a promoter), we may have overlapping features.

For example, a certain region of a gene which codes for multiple transcripts may be an exon for one of the transcripts but an intron for another.

Thus, when you annotate your regions (e.g. your CpG locations), either you "pick" one of the annotations or specify all of them (e.g. in a comma-separated list). The problem is that if you choose the second option, you will have problems when plotting or performing subsequent enrichment analyses, etc.

If you check the ChIPseeker guide, you'll see the priority that the package uses (I most analyses I would use the default mainly because otherwise you would have to find a reasonable justification). Often the "priority" is arbitrarily picking the first annotation in the list.

The other issue (distance to annotations, e.g. distance to genes) is that when you annotate, you may have a rule stating that "if my location is X bp distance to this feature, I will annotate". So this is also something to at least be aware of. For example (this is an example, I don't remember ChIPseeker's specifics), ChIPseeker may annotate locations to the "closest" gene, even if the gene is at a distance of 100000 bp, but you may think that this is not "biologically reasonable" for your specific project.

ADD REPLY • link 3.6 years ago by Papyrus ★ 3.0k

0

Entering edit mode

ah ok I understand. Thanks for your great explanation. However, I have a list of CpG positions (.txt files) for several samples (3 replicates for 2 condition) and the differential analysis between them. What would be my best or most correct choice? I'm considering annotatr, what do you think? And what would be the best code to use in my case? Thank you

ADD REPLY • link 3.6 years ago by GiV17 ▴ 50

0

Entering edit mode

Well that really would depend on the specifics of your project, the analyses that will be performed downstream, etc. and there is no one true solution. In general the safest route is to stick to well-known packages and their defaults and specify it clearly in your methods. Check the user guides and the code and file examples provided there.