I was doing a differential gene expression analysis on a microarray dataset with three controls and 7 case samples. I used the limma package in RStudio to do the analysis. Wrote the following code based on Bioconductor vignettes and online tutorials:
library(limma)
edata<-exprs(group) #EXPRESSION MATRIX
mod<-model.matrix(~group_pheno$status) #group_pheno contains data on the case/control phenotype of the samples
fit<-lmFit(edata,mod)
ebay<-eBayes(fit)
res<-topTable(ebay,sort.by = "none",number = dim(edata)[1])
I got around 91 differentially expressed genes from the above analysis (adj.P.val < 0.05).
However, when I tried to do the same analysis in the GEO2R option in the GEO database, I got no DEGs at the significance level adj.P.Val < 0.05. The lowest reported adj.P.value was around 0.079. I checked the Rscript given in the webpage for this analysis (given below)
# load series and platform data from GEO
gset <- getGEO("GSE34526", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
# make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))
# group names for all samples
gsms <- "0001111111"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }
# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
(qx[6]-qx[1] > 50 && qx[2] > 0) ||
(qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }
# set up the data and proceed with analysis
sml <- paste("G", sml, sep="") # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)
tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title"))
write.table(tT, file=stdout(), row.names=F, sep="\t")
It seems that here they have used contrast matrix designs to calculate the DEGs. Can you please explain the correct approach to this problem and Where the mistake is in my code?
Thanks in Advance!
Thank you, Kevin!
The group_pheno is a data frame (which I created manually) with two columns, the first one being the sample name and the other is the disease status (case or control).
I tried this as well by adding the argument coef ="group_pheno$statusPolycystic ovary Syndrome" which is the second column in mod ---
res<-topTable(ebay,sort.by = "B",number = dim(edata)[1],coef = "group_pheno$statusPolycystic ovary Syndrome",adjust.method = "fdr")
Still, I am getting the same result of 91 DEGs as opposed to what I got from the GEO2R (0 DEG). Any ideas on where it might have gone wrong?
Thanks for pasting that data, but how does your design matrix relate to the
G1
andG0
of the bottom code chunk? On face value, it looks like you are comparing 2 different traits in your top and bottom code chunks. That will explain the difference in end results.Just take your time, and go through it step by step.
By the way, based on topTable's default behaviour, I think that your results are those of a differential expression comparison between Polycystic Ovary Syndrome status
1
versus0
, i.e., the final column in your design matrix,mod
. However, in the bottom code chunk, this is not what is being compared.PS -
group_pheno$statusPolycystic ovary Syndrome
is an ugly name for something like this. You should rename either the columns of the design matrix, or the original values in group_pheno.Then topTable would be:
After that, please do not even compare the results to those of the bottom code chunk, because different traits are being compared.
Thank you for the suggestion, Kevin! Will make that change in the colnames.
My ultimate goal is to identify the genes that are differentially expressed in the seven POS case compared to the three normal healthy controls, so is the code chunk (top) I used does this comparison? or should I go with the GEO2Rs (bottom chunk) for my analysis?
I am too struggling with understanding how they can be related. I understand in the bottom chunk that they are assigning a G0/G1 (control/case) factor for the samples in the Expression Set and later use that to compare the groups via contrast matrix. But it does not seem straight forward in how the groups are being compared...
Hello Kevin!
Could you kindly please clarify my doubt on which code chunk to use for the type of analysis I am working on.
I'm quite new to this and really clueless about which is the right approach for my work...
Thank you
Hey Dude / Dudette. I would use the first code chunk. The second one, which is [I believe] automatically generated by GEO, can be misleading and sometimes incorrect / unsuitable for the study.
I believe that should work.