Today's topic is the importance of array quality control. Many articles and websites related to microarray analysis have demonstrated that array quality control is privotal process involved in improvement of identification of differential gene expression. I introduce one example associated with array quality through arrayQualityMetrics.
# Open R
# Import ArrayExpress data (E-GEOD-3419) into R
source("http://bioconductor.org/biocLite.R")
biocLite("ArrayExpress")
library(ArrayExpress)
AEset<-ArrayExpress("E-GEOD-3419")
# Normalization of E-GEOD-3419
library(affy)
AEsetnorm<-rma(AEset)
# Array Quality Control through arraQualityMetrics
biocLite("arrayQualityMetrics")
library(arrayQualtiyMetrics)
# arrayQualityMetrics of AEset (RAW data of E-GEOD-3419)
arrayQualityMetrics(expressionset=AEset,outdir="Report_for_E-GEOD-3419",force=TRUE,do.logtransform=TRUE)
# arrayQualityMetrics of AEsetnorm (normalized data of E-GEOD-3419) - rma normalization convert expression value into log-transformed expression value, so we do ont input "do.logtransform".
arrayQualityMetrics(expressionset=AEsetnorm,outdir="Report_for_nE-GEOD-3419",force=TRUE)
Comparison AEset with AEsetnorm
- These tables showed change of outlier arrays before and after normalization.
- Through these tables and plots, I identified outlier arrays (#12 and #16 array) and analyzed DEGs in all arrays and subsets removed outlier arrays through limma
# DEGs analysis (limma) at AEsetnorm dataset removed outlier arrays
library(Biobase)
library(limma)
fvarLabels(AEsetnorm)<-make.names(fvarLabels(AEsetnorm))
gsms<-"01000101110X110X" # first X is array # 12 and second X is array #16
sml<-c()
for(i in 1:nchar(gsms)) { sml[i]<-substr(gsms,i,i) }
sel<-which(sml !="X")
sml<-sml[sel]
gset<-AEsetnorm[,sel]
sml<-paste("G",sml,sep="")
fl<-as.factor(sml)
gset$description<-fl
design<-model.matrix(~ description + 0,gset)
colnames(design)<-levels(fl)
fit<-lmFit(gset,design)
cont.matrix<-makeContrasts(G1-G0, levels=design)
fit2<-contrasts.fit(fit,cont.matrix)
fit2<-eBayes(fit2, 0.01)
tT<-topTable(fit2, adjust="fdr",sort.by="B",number=250)
class(tT)
'data.frame': 250 obs. of 6 variables:
$ logFC : num 1.58 -1.6 -2.16 1.67 -3.12 ...
$ AveExpr : num 5.95 7.35 8.25 8.52 8.74 ...
$ t : num 12.49 -9.36 -9.1 9.06 -8.87 ...
$ P.Value : num 1.87e-09 9.66e-08 1.40e-07 1.48e-07 1.96e-07 ...
$ adj.P.Val: num 4.18e-05 8.25e-04 8.25e-04 8.25e-04 8.74e-04 ...
$ B : num 10.62 7.62 7.31 7.27 7.04 ...
tT<-subset(tT,select=c("AveExpr","adj.P.Val","P.Value","t","B","logFC"))
write.table(tT,file="GSE3419.txt",row.names=T,sep="\t")
# DEGs analysis (limma) at AEsetnorm dataset (all arrays - 16 arrays)
fvarLabels(AEsetnorm)<-make.names(fvarLabels(AEsetnorm))
gsms_1<-"0100010111001100"
sml_1<-c()
for(i in 1:nchar(gsms_1)) { sml_1[i]<-substr(gsms_1,i,i) }
sml_1<-paste("G",sml_1,sep="")
fl<-as.factor(sml_1)
AEsetnorm$description<-fl
design<-model.matrix(~ description + 0,AEsetnorm)
colnames(design)<-levels(fl)
fit1<-lmFit(AEsetnorm,design)
cont.matrix<-makeContrasts(G1-G0,levels=design)
fit2<-contrasts.fit(fit1,cont.matrix)
fit2<-eBayes(fit2,0.01)
tT<-topTable(fit2,adjust="fdr",sort.by="B",number=250)
tT<-subset(tT,select=c("AveExpr","adj.P.Val","P.Value","t","B","logFC"))
write.table(tT,file="GSE3419_2.txt",row.names=T,sep="\t")
Venndiagram of 2-fold change DEGs between AEsetnorm dataset removed outlier arrays and AEsetnorm all arrays
I identified the clear difference before and after outlier removal. In conclusion, I think that outlier removal is pivotal process to find DEGs correctly.
Thank you majuang66,helpful and nice Tutorial