In R scripts of GEO2R which line is responsible for background correction and replacing replicated probes with the mean?
2
0
Entering edit mode
5.0 years ago
Sib ▴ 60

The commands below are the R scripts that are used to analyze my microarray data. I want to know which lines are responsible for 1-Replacing replicated probes with the mean 2-Background correction

#   Differential expression analysis with limma
library(Biobase)
library(GEOquery)
library(limma)

# load series and platform data from GEO

gset <- getGEO("GSE116959", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL17077", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
gsms <- "00010000000100000000000000000000011110000010000000000000000011010010"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }

# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
          (qx[6]-qx[1] > 50 && qx[2] > 0) ||
          (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
  exprs(gset) <- log2(ex) }

# set up the data and proceed with analysis
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)
R Differential expression analysis • 2.3k views
ADD COMMENT
3
Entering edit mode
5.0 years ago
Ahill ★ 2.0k

From a quick look at the script and at the GEO record for GSE116959, I'd say there are no lines in this code that do either replacement of replicated probes or background correction. The source GEO data appears to be log2-scale quantile-normalized codelink array intensities, so in your script I expect LogC will be false. The final 10 lines of the script set up a linear model, fit it, and compute differential expression between your two factor levels. If replacement of replicated probes or background correction is being done it is not in this script or mentioned in GEO data processing description. Something like background correction might be done as part of the primary Agilent data reduction, but can't tell that from the above alone.

ADD COMMENT
0
Entering edit mode

Thanks for your answer. Can we say whenever we obtain datasets by using getGEO function, the data are background corrected and duplicates are substituted with mean, and even if we want to have our analyze by R (without GEO2R), if we get datasets through this function there is no need to background correction and substitution of duplicates with the mean?

ADD REPLY
1
Entering edit mode

It would not be safe to say that background correction and duplicate substitution have been done. That would be determined by the original authors who submitted the data to GEO. GEO does not require specific data processing methods as a practice. For many common expression platforms background subtraction is likely done; and for some GEO submissions, duplicate processing might have been done. But you would need to read the associated manuscript or the data processing description on the GEO record to determine with certainty if that processing has been done for any specific GEO dataset.

ADD REPLY
3
Entering edit mode
5.0 years ago
ATpoint 86k

I think GEO2R assumes that the data are already normalized when pulling via getGEO() (correct me if I am wrong). That means they use what the authors have uploaded when submitting the data. This is actually the problem and also the reason why I never use this application. You have essentially no control over how the data have been processed and which probes have been removed (e.g. control probes). This can have quite an impact because control probes can be quite numerous (thousands in some arrays) and this then influences e.g. multiple testing correction (if not removed they increase MT burden). If possible I always start from raw CEL files. If these are not available be super careful with the results.

ADD COMMENT

Login before adding your answer.

Traffic: 1935 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6