Question

Using R To Perform Statistical Tests On Microarray Data And Cluster The Results

6

Entering edit mode

14.3 years ago

Walter Jessen ▴ 90

In the past, I've used GeneSpring to analyze microarray data. Now I'm teaching myself how to do it all (normalization, statistical testing and clustering) with R. I've worked out everything up to the very end and have several specific questions. Also, please let me know if there's an easier or better way to do something.

I started off with a simple experiment: 25 CEL files - 9 normal samples, 16 Parkinson's disease samples.

http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE7621

My goal is to RMA the data, perform a t-test and cluster the results at a given statistical cutoff.

Here's the code I've been using:

library(simpleaffy)
raw.data<-read.affy("sampledesc.txt")
rma.data<-call.exprs(raw.data,"rma")
results<-pairwise.comparison(rma.data,"class",c("normal","pd"),raw.data)
results.f<-pairwise.filter(results,tt=0.1,fc=1)
hmap.pc(results.f,rma.data,col="bwr",scluster=standard.pearson,pcluster=standard.pearson))

I can't figure out how to cluster the samples using euclidian distance. I can't find any documentation on the different types of metrics available or how to call them (scluster=standard.euclidian doesn't work). Also, I'd like to reference the expression changes to the normal samples such that the normals all show expression around 0 and the PD samples reflect changes up or down from normal. I'm not sure how to do this in R.

For a more complex data set, I couldn't figure out how to use the genefilter package to do an anova (hence the use of the stats package). For this example, I'm using an experiment with 48 CEL files (3 samples at each dose and day) divided into 4 doses (control, doses 1-3) collected on 4 days.

http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18753

My goal is to RMA the data, perform a one-way anova to identify genes differentially expressed between days and/or doses, and cluster the results at a given statistical cutoff.

Here's the code I've been using:

library(affy)
library(simpleaffy)
library(stats)
pd<-read.AnnotatedDataFrame("sampledesc.txt",sep="\t",header=T)
raw.data<-ReadAffy(filenames=pData(pd)$FileName,phenoData=pd)
rma.data<-call.exprs(raw.data,"rma")
lm.coef<-function(y) lm(y ~ factor(Day) * factor(Dose), data=pData(pd))$coefficients
eff<-esApply(rma.data,1,lm.coef)

Everything works up to this point. However, I can't figure out how to filter the anova results (say, to only look at genes <0.001) and cluster them in a heat map.

Any examples or advice would be greatly appreciated!

r affymetrix gene heatmap • 8.8k views

ADD COMMENT • link updated 9.4 years ago by Biostar 20 • written 14.3 years ago by Walter Jessen ▴ 90

score 6 · Answer 1 · 2011-05-02

The standard dist() function in R uses Euclidian distance by default; to cluster the columns of a matrix, you could use

 d = dist( t(M) ) 
 h = hclust(d)
 plot(h)

For your second question, it sounds like what you want to do is standardize your data set against one set ("normal" samples) on a gene-by-gene basis. I suppose you could calculate the means of just those samples and then mean-center all samples against that mean. To do this with rowMeans, assuming the first 9 samples are your normals, do something like

normal.means = rowMeans(M[,1:9])
M.normals.standardized = M - normal.means

For log2 data, this would convert your results to log change from mean normal; if your normals are homogenous, that should bend the plot the way you want. You would have to explain pretty carefully what you did.

For your filtering question, you can grab the p-values out of a linear model with the summary() function. Example:

x = c(1,2,3,4,5)
y = c(1,2,5,7,5)
summary(lm(x~y))$coefficients[2,4]  # 0.07, P value of fit for Y

You should be able to adapt that to filter a more complex model. Remember to think about multiple testing issues.