Hi,
I have a dataframe where I have Gene Names
, regression estimates
(for 5mC methylation data: a positive estimate would indicate hypermethylation, while a negative estimate would indicate hypomethylation in the disease group. These estimates are averaged at gene level, initially I had these values for each CpG site) and logFC
computed by limma (positive value means genes are up-regulated in disease, negative values means they are down-regulated in diseased state). This is how my dataframe looks like:
> data[1:3,]
Gene Reg_Beta logFC
1 A1BG 0.012759505 -0.01594659
2 A1CF 0.003407954 0.01044036
3 A2M 0.004816774 0.37067536
Can anybody guide me if I can obtain correlation between Reg_Beta
(avg. beta value for methylation status of a gene) and logFC
(expression value of that gene) at gene level? So that at the end I can get those genes for which I can say they are highly anti-correlated to gene expression.
I am a newbie to methylation analysis, any constructive suggestion or comment will be highly appreciated! Thanks.
For a correlation you'll need more than only one data point per group. You have for each gene Group A: Reg_beta (one value) and Group B: logFC (one value). For proper correlation you need a set of points for both A and B.
Thank you for your comment. If I consider the original beta values (averaged per gene level) for disease group and similarly for healthy group and then I add log2 normalized expression values for diseased and healthy samples (at gene level) then how would I get what I am looking for. For example say col1 will be gene name, col2:35 are beta value of diseased samples, col 36:70 are beta value of healthy samples, col 71:91 are expression values for diseased sample and col 92:102 are expresion values for healthy samples. Can you guide me how will I design the comparisons in this case so that the results make sense and I get what I am looking for.
For correlation you'll need the same number of values in Group A as in Group B, and they need to be paired, this pairing needs to be meaningful (not random).
I am not sure what your Reg_beat values are, and how they link to your expression values. Are these paired?
If you make a plot with A in x-axis and B in y, then each value needs to be paired and becomes one point. The correlation is then how all these points fit a line.
Hope this helps, if not please read e.g., wiki about correlation.