Hello Biostars community,
I am currently working on a methylation analysis using data obtained from an Illumina EPIC array. The RNA samples underwent pre-processing and quality control by a company, and they provided us with two txt files containing the m-values and b-values, respectively. However, I encountered an issue when trying to annotate the data with gene information using the chipseeker package.
The structure of the data in the txt file is as follows (showing only the first 2 lines out of 17,000):
island patient1 patient2 patient3 control1 control2 control3
cg14817997 3.09546 2.11941 2.90160 1.91566 2.45522 2.36514
To group the data correctly, I used the following R code:
df <- data.frame(test = 1, d = c(rep('patient', 3), rep('control', 3)))
df$d <- factor(df$d, levels = c('patient', 'control'))
Then, to identify differentially methylated positions (DMPs), I used the dmpFinder function from the minfi package with the following code:
mat <- as.matrix(dataset)
dmp <- dmpFinder(mat, pheno = df$d, type = "categorical")
The resulting dmp dataframe contains columns with the cg codes, p-values, q-values, intercept, and f-values.
Next, I attempted to use the chipseeker package to annotate the DMPs with gene information by following this code:
# Saving the annotation of the array used (EPIC) in 'annEPIC'
annEPIC <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
# Finding corresponding rows in the annotation dataframe
indices <- match(dmp$Name, annEPIC$Name)
# Adding new columns to 'dmp' using these indices
dmp$chr <- annEPIC$chr[indices]
dmp$pos <- annEPIC$pos[indices]
The problem I encountered is that out of my 797,289 rows, 599 of them do not have any assigned chromosome (chr) or position information (pos). I am curious to know if anyone understands the reason behind this issue. Could it be possible that these particular cg sites do not exist in the EPIC array, leading to the absence of chromosome and position data? Or are there other reasons that could explain this data loss?
I would greatly appreciate any insights or suggestions to help me resolve this matter. Thank you in advance for your assistance!
Best regards, Irene
Could you share some example of those probes ID ?
I paste here some of the probes ID that I can not find. cg25103802 cg21649660 cg04023162 cg15414698 cg24497877 cg01023902 cg19296556 cg23303415 cg24630373 cg04147158 cg13271783 cg16080958 cg23032316
Thank you for your time.
There seems to be a mix of probes exclusively from EPICv2 and 450k. Maybe it is a custom EPIC array ? I see on Illumina site that it is possible to add 3000-10000 probes in a custom selection on EPICv2. Otherwise if you download EPICv2 and 450k annotation, you may be able to retrieve some of those probes annotation (it seems to work on the examples you provided, but not 100% guarantee all would be found).
I tried to look for the evidence I couldn't find in the 450k array, but I can't find it either. I don't know how you got it? Trying with the code I used before but changing EPIC for the 450k annotation, I still can't find those tests.
Thank you for your help.
Using the code you provided, I managed to found some that you provided :