In a similar situation, I would suggest limiting the number of probes and genes considering some metrics like being differentially methylated or being differentially expressed. Otherwise, you will end up in a situation doing a test thousands/millions of times! In your case, if you include all probes and all genes then you would need at least 16,000,000,000 times (800 K probes x 20 K genes) to do the correlation analysis.
I have prepared a tutorial for doing different types of analysis on methylation data and they are available here and here. Check them out. You may find the section "combined analysis" and I guess that is kind of what you are looking for.
Anyway to answer your question and give you some insight on code;
[UPDATED on 2022-05-16]
# toy version of the input data
DF1 = data.frame(pat1meth = runif(10, min=0, max=1),
pat2meth = runif(10, min = 0, max = 1),
pat3meth = runif(10, min = 0, max = 1),
row.names = paste0('probe', seq(1:10)))
DF2 = data.frame(pat1exp = runif(10, min=5, max=15),
pat2exp = runif(10, min = 5, max = 25),
pat3exp = runif(10, min = 5, max = 30),
row.names = paste0('gene', seq(1:10)))
# creating an empty dataframe to store for loops results
corRes = data.frame(probe=character(0),gene=character(0), pval=numeric(0), cor=numeric(0))
for(p in 1:nrow(DF1)){ # DF1 contains methylation data and probes are in rows
pr = as.numeric(DF1[p,])
probe = rownames(DF1)[p]
print(probe)
for(g in 1:nrow(DF2)){ # DF2 contains expression data and genes are in rows
gene = rownames(DF2)[g]
print(gene)
ge = as.numeric(DF2[g,])
# corelation calculation
res <- cor.test(pr,ge, method = "spearman")
# saving pvalue
pval = round(res$p.value, 4)
# corr coeff
cor = round(res$estimate, 4)
# putting data into the result dataframe
idx = nrow(corRes)+1
corRes[idx,] <- c(probe,gene, pval, cor)
}
}
Result
corRes
probe gene pval cor
1 probe1 gene1 1 -0.5
2 probe1 gene2 0.3333 -1
3 probe1 gene3 1 -0.5
4 probe1 gene4 1 -0.5
5 probe1 gene5 1 -0.5
6 probe1 gene6 1 -0.5
7 probe1 gene7 1 0.5
8 probe1 gene8 1 -0.5
9 probe1 gene9 1 0.5
10 probe1 gene10 1 0.5
11 probe2 gene1 1 0.5
12 probe2 gene2 1 -0.5
13 probe2 gene3 1 0.5
14 probe2 gene4 1 0.5
15 probe2 gene5 1 0.5
16 probe2 gene6 1 0.5
Hi thank you very much for your quick response! I have tried out this code but unfortunately keep getting the error:
Error in cor.test.default(tmp$meth, tmp$exp, method = "spearman") : 'y' must be a numeric vector
The tmp dataframe looks like this:
res <- cor.test(tmp$meth, tmp$exp, method = "spearman")
Therefore doesn't work because tmp$exp isn't one column but many columns. The tmp dataframe has pulled the gene expression levels for every sample for this particularly gene and put them into a separate column instead of putting each value into a seperate row.
What should I do in this instance?
Oh I see, to make the code work, I modified that a bit. So there is no need to make
tmp
data frame anymore and also I addedas.numeric()
wherepr
andge
are defined. Not having this numeric conversion was the reason that you got the error"Error in cor.test.default(tmp$meth, tmp$exp, method = "spearman") : 'y' must be a numeric vector"
This time I used simulated data framesDF1
andDF2
to replicate your scenario. See the updated reply.If you find my reply helpful please upvote and if that solves your problem, please accept it.