Question

How to study Correlation from log2CPM count matrix in R?

0

Entering edit mode

3.2 years ago

ali ▴ 20

Hi everyone , Im doing some RNA-Seq Analysis on data took from GEO . Now I'm new in this field and I'm doing step by step as they show in their PDF but I dont really know how to do this Correlation step. I have genes and samples (Tumor[Brca] and non Tumor). Given the count matrix with all samples and the related sample/Condition matrix , how can I do what the PDF says (Picture below explains the steps) ?

cpm <- cpm(counts, prior.count=1,log=TRUE)

This ones are part of my matrix (LEFT cpm matrix , RIGHT sample/condition matrix) :

enter image description here

In the pictures down below is shown their PDF reference to the plot. I should retrive the same plot but I dont know how to use the cor-function with a count matrix.

This is what they did

This is the plot

GEO R • 1.5k views

ADD COMMENT • link 3.2 years ago by ali ▴ 20

0

Entering edit mode

It's just cor(countmatrix), nothing more.

ADD REPLY • link 3.2 years ago by ATpoint 88k

0

Entering edit mode

Yes but how can I plot it? like how do I plot Healthy from Tumor . I do have the corresponding Sample-Condition matrix but idk how to plot given the correlation matrix and the condition.

ADD REPLY • link 3.2 years ago by ali ▴ 20

1

Entering edit mode

plot(x,y) and then add text with the correlation calculated by cor, or use the corrplot package. Please be precise towards what yoou want to do if you need code suggestions.

ADD REPLY • link 3.2 years ago by ATpoint 88k

0

Entering edit mode

Ok , first of all tnx for the reply and for your time. I do have raw counts of a Tumor (Brca) and Healthy Control. The samples are 286 and the genes after filtering are approximately 5000. I had to compute the Differential Gene Expression. I did so following the edgeR Guide step by step and I did so by doing :

dge <- DGEList(counts=counts, genes=rownames(counts)) 
dge <- calcNormFactors(dge, method='TMM') #Normalize (thats what the experiment owner did)
design <- model.matrix(~ 0 + Tissue) #Where tissue is a factor of Brca Tumor / Healthy Control
#Dispertion
dge <- estimateGLMCommonDisp(dge, design = design, verbose=TRUE)  
dge <- estimateGLMTrendedDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)
#Differential Expression
fit <- glmFit(dge, design)
lrt <- glmLRT(fit, contrast = single.contrast) #single.contrast is a contrast Tumor - HC

Now after all of this I have a table with Log2Fold Change , LogCPM, FDR, PValue and here everything is fine, I did the calculations and obtained upReg and downReg and did DAVID GO to find pathways and it came the same ones as the experiment. But they did this correlation matrix that I want to reproduce but I literally dont know how to do it. If I should use the DGEList or instead I have to use the counts from raw and do indipendentely another thing. So my final question is, how can I obtain the mean log2CPM of each condition for each gene? Because in the plot I assume that the dots are the genes and the Log2CPM are the mean for each gene for each condition (Tumor and Healthy). The thing is that I have not 2 samples but 286 so I dont understand what do I have to look at. My idea is that maybe I do have to apply the log2CPM to the count matrix and gourp the samples by condition doing the rowMean() function? And when I do have the means I plot a scatter plot? I'm kinda lost.

ADD REPLY • link 3.2 years ago by ali ▴ 20