Correlating DNA methylation beta values with Gene expression levels (RNA-seq)
1
1
Entering edit mode
2.6 years ago
sswang25 ▴ 20

I am currently trying to write code in R to analyse the relationship between DNA methylation values at probe sites and gene expression levels. I have DNA methylation data generated from the Illumina Epic array for 50 patients. I also have these patients normalized RNA-seq gene expression levels in VST form. I have 2 dataframes:

DF 1 contains the beta values for all the methylation probes for all 50 patients. Patients are columns and probes are rows

DF 2 contains the VST data for all 50 patients - patients are columns and genes are rows

I would like to test the correlation (just using spearmans rank) between each individual probe methylation value and each individual gene's expression level for all the patients.

I hope to produce a matrix of R values for each probe/gene combination which I can then use to plot either a volcano plot or MA plot.

I am very new to R and therefore I don't know how to code something like this. The reason I am doing every single correlation is because I am hoping that this catch all approach might provide previously unknown relationships.

correlation methylation DNA RNA-seq R • 1.8k views
ADD COMMENT
1
Entering edit mode
2.6 years ago

In a similar situation, I would suggest limiting the number of probes and genes considering some metrics like being differentially methylated or being differentially expressed. Otherwise, you will end up in a situation doing a test thousands/millions of times! In your case, if you include all probes and all genes then you would need at least 16,000,000,000 times (800 K probes x 20 K genes) to do the correlation analysis.
I have prepared a tutorial for doing different types of analysis on methylation data and they are available here and here. Check them out. You may find the section "combined analysis" and I guess that is kind of what you are looking for.

Anyway to answer your question and give you some insight on code; [UPDATED on 2022-05-16]

# toy version of the input data
DF1 = data.frame(pat1meth = runif(10, min=0, max=1),
                 pat2meth = runif(10, min = 0, max = 1),
                 pat3meth = runif(10, min = 0, max = 1),
                 row.names = paste0('probe', seq(1:10)))

DF2 = data.frame(pat1exp = runif(10, min=5, max=15),
                 pat2exp = runif(10, min = 5, max = 25),
                 pat3exp = runif(10, min = 5, max = 30),
                 row.names = paste0('gene', seq(1:10)))


# creating an empty dataframe to store for loops results
corRes = data.frame(probe=character(0),gene=character(0), pval=numeric(0), cor=numeric(0))

for(p in 1:nrow(DF1)){ # DF1 contains methylation data and probes are in rows
        pr = as.numeric(DF1[p,])
        probe = rownames(DF1)[p]
        print(probe)
        for(g in 1:nrow(DF2)){ # DF2 contains expression data and genes are in rows
                gene = rownames(DF2)[g]
                print(gene)
                ge = as.numeric(DF2[g,])
                # corelation calculation
                res <- cor.test(pr,ge, method = "spearman")
                # saving pvalue
                pval = round(res$p.value, 4)
                # corr coeff
                cor = round(res$estimate, 4)
                # putting data into the result dataframe
                idx = nrow(corRes)+1
                corRes[idx,] <- c(probe,gene, pval, cor)
        }
}

Result

corRes
      probe   gene   pval  cor
1    probe1  gene1      1 -0.5
2    probe1  gene2 0.3333   -1
3    probe1  gene3      1 -0.5
4    probe1  gene4      1 -0.5
5    probe1  gene5      1 -0.5
6    probe1  gene6      1 -0.5
7    probe1  gene7      1  0.5
8    probe1  gene8      1 -0.5
9    probe1  gene9      1  0.5
10   probe1 gene10      1  0.5
11   probe2  gene1      1  0.5
12   probe2  gene2      1 -0.5
13   probe2  gene3      1  0.5
14   probe2  gene4      1  0.5
15   probe2  gene5      1  0.5
16   probe2  gene6      1  0.5
ADD COMMENT
0
Entering edit mode

Hi thank you very much for your quick response! I have tried out this code but unfortunately keep getting the error:

Error in cor.test.default(tmp$meth, tmp$exp, method = "spearman") : 'y' must be a numeric vector

The tmp dataframe looks like this:

res <- cor.test(tmp$meth, tmp$exp, method = "spearman")

Therefore doesn't work because tmp$exp isn't one column but many columns. The tmp dataframe has pulled the gene expression levels for every sample for this particularly gene and put them into a separate column instead of putting each value into a seperate row.

What should I do in this instance?

ADD REPLY
1
Entering edit mode

Oh I see, to make the code work, I modified that a bit. So there is no need to make tmp data frame anymore and also I added as.numeric() where pr and ge are defined. Not having this numeric conversion was the reason that you got the error "Error in cor.test.default(tmp$meth, tmp$exp, method = "spearman") : 'y' must be a numeric vector" This time I used simulated data frames DF1 and DF2 to replicate your scenario. See the updated reply.

If you find my reply helpful please upvote and if that solves your problem, please accept it.

ADD REPLY

Login before adding your answer.

Traffic: 2506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6