Hi, all.
I am analyzing Illumina EPIC array DNA methylation data to find significant CpGs based on the linear model in R. After processing the data, I got significant CpGs (FDR<0.05) annotated with UCSC_RefGene_Group (using IlluminaHumanMethylationEPICanno.ilm10b4.hg19 package). I would like to know how the genomic regions annotated for each probe shows their distribution in the probe's data. Now I secured thousands of DNAm probes. To my knowledge, these probes should be annotated with one of the genomic regions from 1stExon, 3'UTR, 5'UTR, Body, ExonBnd (Exon Boundary), TSS 200, and TSS1500 based on the package above (ex. cg0144285 - TSS200). Therefore, I will be able to know the distribution of genomic regions in my probes if I draw a bar graph showing the number of genomic regions for significant probes.
However, my Illumina data shows different annotations that one probe has 1 or more annotated genomic regions in the UCSC_RefGene_Group. Below is the example image for the data structure. In my EPIC array data using IlluminaHumanMethylationEPICanno.ilm10b4.hg19 UCSC_RefGene_Group, more than 1 genomic region is annotated for 1 probe.
In this case, I summed all of the genomic regions in probe sets and divided into the total number of probes to calculate the frequency of genomic regions. However, this could lead to incorrect calculation to know the actual distribution of genomic regions per probe. For this matter, could anyone help me to solve this problem?
Thank you in advance.
I'm not sure I can answer your question but if you want to try this using a different tool for annotating methylation probes according to the UCSC genome browser, there's methylize: https://life-epigenetics-methylize.readthedocs-hosted.com/en/latest/docs/diff_meth_regions.html