I couldn't find a clear solution how to prepare microarray data to analyze it. I have 2 examples:
1. install.packages("plsgenomics")
library("plsgenomics")
data(Colon)
data<-t(Colon$X)
rownames(data)<-Colon$gene.names
grp<-Colon$Y
grp = as.factor(grp)
If we check boxplot we see IQR's of each 62 samples do not match.
boxplot(data)
To solve this problem I tried following code, is it correct?
data = log(data)
data = scale(data)
If we check the side by side boxplot again we can observe IQR and medians match. We can also check mean and sd of each samples:
apply(data, 2,mean)
apply(data, 2, sd)
All means are around 0 and sd's are around 1.
2. source("http://bioconductor.org/biocLite.R")
biocLite("golubEsets")
library('golubEsets')
two_classes <- TRUE
# The training data set.
data('Golub_Train')
x <- t(exprs(Golub_Train))
y <- as.vector(pData(Golub_Train)$ALL.AML)
if(!two_classes) {
y[pData(Golub_Merge)$T.B.cell == "B-cell"] <- "ALL-B"
y[pData(Golub_Merge)$T.B.cell == "T-cell"] <- "ALL-T"
}
golub_train <- list(x = x, y = factor(y))
# The test data set.
data('Golub_Test')
x <- t(exprs(Golub_Test))
y <- as.vector(pData(Golub_Test)$ALL.AML)
if(!two_classes) {
y[pData(Golub_Test)$T.B.cell == "B-cell"] <- "ALL-B"
y[pData(Golub_Test)$T.B.cell == "T-cell"] <- "ALL-T"
}
golub_test <- list(x = x, y = factor(y))
# The combined training and test data sets.
data('Golub_Merge')
x <- t(exprs(Golub_Merge))
y <- as.vector(pData(Golub_Merge)$ALL.AML)
if(!two_classes) {
y[pData(Golub_Merge)$T.B.cell == "B-cell"] <- "ALL-B"
y[pData(Golub_Merge)$T.B.cell == "T-cell"] <- "ALL-T"
}
golub <- list(x = x, y = factor(y))
data<-t(golub$x)
grp<-golub$y
I want to use a variance filter first and select top 2000 genes.
RowVar <- function(x) {
rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)
}
rvar = RowVar(data)
qt <- quantile(rvar,probs = 1-2000/dim(data)[1],na.rm = T)
data <- data[apply( data , 1 , function(x) var(x) > qt[1] ),]
Again let's check side by side boxplot we observe IQR and medians mismatch again. We can't take the log directly as there are negative values. I use following idea:
data = data - min(data) + 0.01
data = log(data)
data = scale(data)
We can't see the match of IQR and match of medians like we did in the first example, but results are much better than raw expressions.
Do you think they are correct or have any better-preprocessing suggestions?
You could check
limma
package.