Question

In R scripts of GEO2R which line is responsible for background correction and replacing replicated probes with the mean?

0

Entering edit mode

5.4 years ago

Sib ▴ 70

The commands below are the R scripts that are used to analyze my microarray data. I want to know which lines are responsible for 1-Replacing replicated probes with the mean 2-Background correction

#   Differential expression analysis with limma
library(Biobase)
library(GEOquery)
library(limma)

# load series and platform data from GEO

gset <- getGEO("GSE116959", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL17077", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
gsms <- "00010000000100000000000000000000011110000010000000000000000011010010"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }

# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
          (qx[6]-qx[1] > 50 && qx[2] > 0) ||
          (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
  exprs(gset) <- log2(ex) }

# set up the data and proceed with analysis
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

R Differential expression analysis • 2.6k views

ADD COMMENT • link 5.4 years ago by Sib ▴ 70

score 3 · Accepted Answer · 2020-01-04

3

Entering edit mode

5.4 years ago

Ahill ★ 2.0k

From a quick look at the script and at the GEO record for GSE116959, I'd say there are no lines in this code that do either replacement of replicated probes or background correction. The source GEO data appears to be log2-scale quantile-normalized codelink array intensities, so in your script I expect LogC will be false. The final 10 lines of the script set up a linear model, fit it, and compute differential expression between your two factor levels. If replacement of replicated probes or background correction is being done it is not in this script or mentioned in GEO data processing description. Something like background correction might be done as part of the primary Agilent data reduction, but can't tell that from the above alone.

ADD COMMENT • link 5.4 years ago by Ahill ★ 2.0k

0

Entering edit mode

Thanks for your answer. Can we say whenever we obtain datasets by using getGEO function, the data are background corrected and duplicates are substituted with mean, and even if we want to have our analyze by R (without GEO2R), if we get datasets through this function there is no need to background correction and substitution of duplicates with the mean?

ADD REPLY • link 5.4 years ago by Sib ▴ 70

1

Entering edit mode

It would not be safe to say that background correction and duplicate substitution have been done. That would be determined by the original authors who submitted the data to GEO. GEO does not require specific data processing methods as a practice. For many common expression platforms background subtraction is likely done; and for some GEO submissions, duplicate processing might have been done. But you would need to read the associated manuscript or the data processing description on the GEO record to determine with certainty if that processing has been done for any specific GEO dataset.

ADD REPLY • link 5.4 years ago by Ahill ★ 2.0k

score 3 · Accepted Answer · 2020-01-04

I think GEO2R assumes that the data are already normalized when pulling via getGEO() (correct me if I am wrong). That means they use what the authors have uploaded when submitting the data. This is actually the problem and also the reason why I never use this application. You have essentially no control over how the data have been processed and which probes have been removed (e.g. control probes). This can have quite an impact because control probes can be quite numerous (thousands in some arrays) and this then influences e.g. multiple testing correction (if not removed they increase MT burden). If possible I always start from raw CEL files. If these are not available be super careful with the results.