Question

Methylation Analysis - Missing Chromosome and Position Information for some CpG sites

0

Entering edit mode

16 months ago

Irene • 0

Hello Biostars community,

I am currently working on a methylation analysis using data obtained from an Illumina EPIC array. The RNA samples underwent pre-processing and quality control by a company, and they provided us with two txt files containing the m-values and b-values, respectively. However, I encountered an issue when trying to annotate the data with gene information using the chipseeker package.

The structure of the data in the txt file is as follows (showing only the first 2 lines out of 17,000):

island patient1 patient2 patient3 control1 control2 control3
cg14817997 3.09546 2.11941 2.90160 1.91566 2.45522 2.36514

To group the data correctly, I used the following R code:

df <- data.frame(test = 1, d = c(rep('patient', 3), rep('control', 3)))
df$d <- factor(df$d, levels = c('patient', 'control'))

Then, to identify differentially methylated positions (DMPs), I used the dmpFinder function from the minfi package with the following code:

mat <- as.matrix(dataset)
dmp <- dmpFinder(mat, pheno = df$d, type = "categorical")

The resulting dmp dataframe contains columns with the cg codes, p-values, q-values, intercept, and f-values.

Next, I attempted to use the chipseeker package to annotate the DMPs with gene information by following this code:

# Saving the annotation of the array used (EPIC) in 'annEPIC'
annEPIC <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)

# Finding corresponding rows in the annotation dataframe
indices <- match(dmp$Name, annEPIC$Name)

# Adding new columns to 'dmp' using these indices
dmp$chr <- annEPIC$chr[indices]
dmp$pos <- annEPIC$pos[indices]

The problem I encountered is that out of my 797,289 rows, 599 of them do not have any assigned chromosome (chr) or position information (pos). I am curious to know if anyone understands the reason behind this issue. Could it be possible that these particular cg sites do not exist in the EPIC array, leading to the absence of chromosome and position data? Or are there other reasons that could explain this data loss?

I would greatly appreciate any insights or suggestions to help me resolve this matter. Thank you in advance for your assistance!

Best regards, Irene

methylation • 1.3k views

ADD COMMENT • link updated 15 months ago by Basti ★ 2.0k • written 16 months ago by Irene • 0

0

Entering edit mode

Could you share some example of those probes ID ?

ADD REPLY • link 16 months ago by Basti ★ 2.0k

0

Entering edit mode

I paste here some of the probes ID that I can not find. cg25103802 cg21649660 cg04023162 cg15414698 cg24497877 cg01023902 cg19296556 cg23303415 cg24630373 cg04147158 cg13271783 cg16080958 cg23032316

Thank you for your time.

ADD REPLY • link 16 months ago by Irene • 0

0

Entering edit mode

There seems to be a mix of probes exclusively from EPICv2 and 450k. Maybe it is a custom EPIC array ? I see on Illumina site that it is possible to add 3000-10000 probes in a custom selection on EPICv2. Otherwise if you download EPICv2 and 450k annotation, you may be able to retrieve some of those probes annotation (it seems to work on the examples you provided, but not 100% guarantee all would be found).

ADD REPLY • link 16 months ago by Basti ★ 2.0k

0

Entering edit mode

I tried to look for the evidence I couldn't find in the 450k array, but I can't find it either. I don't know how you got it? Trying with the code I used before but changing EPIC for the 450k annotation, I still can't find those tests.

# Saving the annotation of the array used (EPIC) in 'annEPIC'
ann450k <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)

# Finding corresponding rows in the annotation dataframe
indices <- match(dmp$Name, ann450k$Name)

# Adding new columns to 'dmp' using these indices
dmp$chr <- ann450k$chr[indices]
dmp$pos <- ann450k$pos[indices]

Thank you for your help.

ADD REPLY • link 15 months ago by Irene • 0

0

Entering edit mode

Using the code you provided, I managed to found some that you provided :

match(c("cg25103802" ,"cg21649660" ,"cg04023162", "cg15414698" , "cg24497877", "cg01023902" ,"cg19296556", "cg23303415" ,"cg24630373", "cg04147158", "cg13271783" ,"cg16080958", "cg23032316"), ann450k$Name)

[1]     NA  50723 320619 312950 174215     NA 464206     NA 279653 418979 336337
[12] 295526 380144

ADD REPLY • link 15 months ago by Basti ★ 2.0k

score 0 · Answer 1 · 2023-07-30

0

Entering edit mode

16 months ago

Zhenyu Zhang ★ 1.2k

I don't know the reason. But here is a good resource of Illumina methylation array annotations, and maybe you can check if those probes have annotations here https://zwdzwd.github.io/InfiniumAnnotation

ADD COMMENT • link 16 months ago by Zhenyu Zhang ★ 1.2k