Hi all
I am trying to analyze a expression profiling dataset from GEO, which was detected by Affymetrix microarray. Its data processing method was described as "The data were analyzed with Microarray Suite GCOS using Affymetrix default analysis settings and global scaling as normalization method. The trimmed mean target intensity of each array was arbitrarily set to 100."
I performed simply log2 transformation and then differential expression analysis, but got very large FC value(beyond 100). It was so weird. And I used the online tool GEO2R to analyze the same dataset and the result looked much more normal. And I looked at its R script(showed below). It seems that transformations are performed only under certain conditions.
I wondered if this was reasonable and usual to do like that. Can I just use the result or script to do my research?
Looking forward to your reply. Thank you~
# log2 transformation
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) || (qx[6]-qx[1] > 50 && qx[2] > 0)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }
You need to share full relevant code to diagnose the issue. This code snippet here does thy to figure out whether the data Geo2R is using are already on log scale or not. If the values are high then it applies log transformation.
Thank you for your reply. I just used the online tool in the GEO database. I understand it applied transformation when the value too high. But I am just curious about whether it is make sense if I use this result in my assay
Here is the full R script
<h6>#</h6>Differential expression analysis with limma
library(GEOquery) library(limma) library(umap)
load series and platform data from GEO
gset <- getGEO("GSE26051", GSEMatrix =TRUE, AnnotGPL=TRUE) if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1 gset <- gset[[idx]]
make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))
group membership for all samples
gsms <- "111111XXXXXXXXXXXXXXXXX0000000XXXXXXXXXXXXXXXX" sml <- strsplit(gsms, split="")[[1]]
filter out excluded samples (marked as "X")
sel <- which(sml != "X") sml <- sml[sel] gset <- gset[ ,sel]
log2 transformation
ex <- exprs(gset) qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T)) LogC <- (qx[5] > 100) || (qx[6]-qx[1] > 50 && qx[2] > 0) if (LogC) { ex[which(ex <= 0)] <- NaN exprs(gset) <- log2(ex) }
assign samples to groups and set up design matrix
gs <- factor(sml) groups <- make.names(c("lesion","non-lesion")) levels(gs) <- groups gset$group <- gs design <- model.matrix(~group + 0, gset) colnames(design) <- levels(gs)
gset <- gset[complete.cases(exprs(gset)), ] # skip missing values
fit <- lmFit(gset, design) # fit linear model
set up contrasts of interest and recalculate model coefficients
cts <- paste(groups[1], groups[2], sep="-") cont.matrix <- makeContrasts(contrasts=cts, levels=design) fit2 <- contrasts.fit(fit, cont.matrix)
compute statistics and table of top significant genes
fit2 <- eBayes(fit2, 0.01) tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)
tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","GB_ACC","SPOT_ID","Gene.Symbol","Gene.symbol","Gene.title")) write.table(tT, file=stdout(), row.names=F, sep="\t")