Question

[R] Same Analysis but different results ?

1

Entering edit mode

3.0 years ago

ali ▴ 20

Hi everyone. Im doing an analysis (DGE) on a count matrix (samples as columns and genes as rows). I have 7 classes ( 6 tumor sample classes and 1 healthy control). I need to do single tumor vs healthy control for each tumor. In order to go faster and avoid to do everything for each class I have done the fit of all the count matrix (DGEList) and then selected the specific contrast (6 contrasts in total).

   #y is the DGEList with the count matrix with all classes so the design has 7 columns
    fit.multi.class <- glmFit(y, design)

    # build the contrast 
    contrast <- makeContrasts(
      BVsH=Breast - HC, #Breast - Healthy Control
      CVsH=CRC - HC, #CRC - Healthy Control
      GVsH=GBM - HC, #GBM - Healthy Control
      HVsH=Hepatobiliary - HC, #Liver - Healthy Control
      LVsH=Lung - HC, #Lung - Healthy Control
      PVsH=Pancreas - HC, #Pancreas - Healthy Control
      levels=colnames(design))

    # Breast vs HC
    lrt.brca.hc <- glmLRT(fit.multi.class, contrast = contrast[,"BVsH"])

    # CRC vs HC
    lrt.crc.hc <- glmLRT(fit.multi.class, contrast = contrast[,"CVsH"])

    # GBM vs HC
    lrt.gbm.hc <- glmLRT(fit.multi.class, contrast = contras[,"GVsH"])

    # HBC vs HC
    lrt.hbc.hc <- glmLRT(fit.multi.class, contrast = contrast[,"HVsH"])

    # NCSLC vs HC
    lrt.lung.hc <- glmLRT(fit.multi.class, contrast = contrast[,"LVsH"])

    # Pancreas vs HC
    lrt.paad.hc <- glmLRT(fit.multi.class, contrast = contrast[,"PVsH"])

Everything works well except a big problem. In order to test if this method was actually really giving me the single contrasts I asked , I decided to test it on GBM Vs HC. So As shown below I obtained a sub-matrix with only GBM and HC samples, and then applied the contrast and the results are different! How is possible , I thought the operation I made above would give me the same PValue/FDR , I know that above I did a "multi-class" fit but I thought that the contrast was exactly for giving me PValue for each single contrast. I dont know now which should I use for my analysis.

#y is the count matrix with only GBM and HC samples
fit <- glmFit(y, design)

# build the contrast 
contrast <- makeContrasts(GBM - HC,levels=colnames(design))

# perform likelihood ratio test for differential expression
lrt <- glmLRT(fit,contrast = contrast)

lrt in the end comes out different from lrt.gbm.hc computed before in terms of PValues/FDR.

R edgeR GEO • 1.1k views

ADD COMMENT • link updated 3.0 years ago by Giovanni M Dall'Olio 28k • written 3.0 years ago by ali ▴ 20

score 3 · Accepted Answer · 2022-06-28

Let me see if I understand the problem correctly:

you start with a bit counts matrix containing data for several diseases
first, you fit glmLRT on the whole matrix, then extract the contrast for GBM - HC
in parallel, you subset this matrix to get only data for GBM and HC, then fit the model GBM - HC
you observed different results for the two models (how much different?)

In principle yes, this is perfectly normal and expected, and you should preferentially use your first analysis with the big counts matrix, although the second analysis is not wrong either.

I am not an edgeR expert, but I see that glmLRT fits an ANOVA on the coefficients to see if any of them are different from zero. In general, with ANOVA, more data allows modeling the variance more accurately, even if it is data from subgroups that you are not interested in extracting in your contrasts. In layman's words, fitting the model on a bigger dataset will improve the way variance is modeled and make your predictions more accurate. Your model will have more information on the expected variance of each gene across all samples, and this will allow a better estimate on whether the difference observed between GBM and HC is significant or within the expected variance for the gene.

There is one thing to check on your code, though. Could you paste the code you used you create the sub-matrix with only GBM and HC samples? Just to check if it has been done properly, sometimes R's syntax for subsetting is tricky.