Question

EPIC array data - How to find the total number of genes in the array?

0

Entering edit mode

3.6 years ago

SSP • 0

Hi, I am working with some DNA methylation data and have a few questions about gene annotation.

1) After preprocessing and quality control, we have a final data set consisting of 760 500 probes, not 850 000. How do I find the total number of genes in EPIC array after this preprocessing/filtering?

2) If I have a list of genes of interest, how do I find out if these are covered by the EPIC array (i.e. included in EPIC array)?

3) If I have a list of genes, how do I find the total number of probes annotated to them?

The reason I am asking about this is that I want to perform a Fisher exact test/ or Chi square. If I want to test if the number of differentially methylated CpGs annotated to genes associated with for example cancer is higher than expected than chance, is it correct to use the number of CpGs/probes or the number of genes?

Let me add that this is really not my field and I have only had an introduction course to R so far. Very grateful for any good advices and tips!

Square array EPIC methylation exact DNA Chi Fisher • 2.2k views

ADD COMMENT • link updated 3.6 years ago by prasundutta87 ▴ 720 • written 3.6 years ago by SSP • 0

score 0 · Answer 1 · 2021-11-08

0

Entering edit mode

3.6 years ago

prasundutta87 ▴ 720

I will reply on the genes part. EPIC arrays are associated with their Illumina specific manifest files which contain the information on genes associated with each CpG loci as a separate column. You can find the manifest here: https://emea.support.illumina.com/array/array_kits/infinium-methylationepic-beadchip-kit/downloads.html in the "Infinium MethylationEPIC Product Files" section. Check out the "UCSC_RefGene_Name" column. Hope this helps.

ADD COMMENT • link 3.6 years ago by prasundutta87 ▴ 720

0

Entering edit mode

Thank you so much for taking your time to answer. We have the annotation for all significant DMPs, however, how to find the total number of genes?

ADD REPLY • link 3.6 years ago by SSP • 0

0

Entering edit mode

I think the total number of genes can be easily attained by just taking the count of the unique genes present in the "UCSC_RefGene_Name" column from the manifest file. The package dplyr in R will be helpful for doing this. You may have to deal with the commas separating the gene names for each CpG loci.

ADD REPLY • link 3.6 years ago by prasundutta87 ▴ 720