Hi everyone. Im doing an analysis (DGE) on a count matrix (samples as columns and genes as rows). I have 7 classes ( 6 tumor sample classes and 1 healthy control). I need to do single tumor vs healthy control for each tumor. In order to go faster and avoid to do everything for each class I have done the fit of all the count matrix (DGEList) and then selected the specific contrast (6 contrasts in total).
#y is the DGEList with the count matrix with all classes so the design has 7 columns
fit.multi.class <- glmFit(y, design)
# build the contrast
contrast <- makeContrasts(
BVsH=Breast - HC, #Breast - Healthy Control
CVsH=CRC - HC, #CRC - Healthy Control
GVsH=GBM - HC, #GBM - Healthy Control
HVsH=Hepatobiliary - HC, #Liver - Healthy Control
LVsH=Lung - HC, #Lung - Healthy Control
PVsH=Pancreas - HC, #Pancreas - Healthy Control
levels=colnames(design))
# Breast vs HC
lrt.brca.hc <- glmLRT(fit.multi.class, contrast = contrast[,"BVsH"])
# CRC vs HC
lrt.crc.hc <- glmLRT(fit.multi.class, contrast = contrast[,"CVsH"])
# GBM vs HC
lrt.gbm.hc <- glmLRT(fit.multi.class, contrast = contras[,"GVsH"])
# HBC vs HC
lrt.hbc.hc <- glmLRT(fit.multi.class, contrast = contrast[,"HVsH"])
# NCSLC vs HC
lrt.lung.hc <- glmLRT(fit.multi.class, contrast = contrast[,"LVsH"])
# Pancreas vs HC
lrt.paad.hc <- glmLRT(fit.multi.class, contrast = contrast[,"PVsH"])
Everything works well except a big problem. In order to test if this method was actually really giving me the single contrasts I asked , I decided to test it on GBM Vs HC. So As shown below I obtained a sub-matrix with only GBM and HC samples, and then applied the contrast and the results are different! How is possible , I thought the operation I made above would give me the same PValue/FDR , I know that above I did a "multi-class" fit but I thought that the contrast was exactly for giving me PValue for each single contrast. I dont know now which should I use for my analysis.
#y is the count matrix with only GBM and HC samples
fit <- glmFit(y, design)
# build the contrast
contrast <- makeContrasts(GBM - HC,levels=colnames(design))
# perform likelihood ratio test for differential expression
lrt <- glmLRT(fit,contrast = contrast)
lrt in the end comes out different from lrt.gbm.hc computed before in terms of PValues/FDR.
First of all thank you for your reply , it is really clear and well explained. Your understanding is correct. The result in terms of classification are better with the first method indeed , I was just scared was not right. The differences are not to much , for instance the top genes are basically the same , but this little differences brings in the end some genes that are not in both. Here is the code for the sub-matrix :
Last question , I do the normalization of the count matrix before actually do the fit (method = "TMM") and of course doing the normalization on the full matrix (with all classes) produce slightly different results than doing it on the sub-matrix (2 classes) , this is normal right?
NP! Your code for subsetting looks correct to me. It makes sense that the differences are not too much. The genes that appear significant in the second method but not in the first may have high variance. For the TMM normalization, I think it should be normal because TMM uses a scaling factor computed on the mean between each pair of samples