Hi, all
I have a question on microarray data processing. Here is what I have done:
we use ilumina humanHT12 microarray to profile gene expression changes on ~100 samples. after normalized with lumi r package(background adjusted, variance stablized and normalized with "ssn"), I randomly selected two sets of genes(about 100 genes for each set) from the data matrix(15301 genes ;141 samples),take the median value for each set of genes across all samples and then plot the value of one set against the other. to my surprise, I have found a correlation between the two randomly selected gene sets. Anyone could explain about this?
#dat is the expression matrix ##generate random index set1.index=sample(1:nrow(dat),100) set2.index=sample(1:nrow(dat),100) set1.dat=dat[set1.index, ] set2.dat=dat[set2.index, ] ##take the median value across samples aggregate(set1.dat, by=list(set=rep(1,nrow(set1.dat))),FUN=median)->set1.aggr aggregate(set2.dat, by=list(set=rep(1,nrow(set2.dat))),FUN=median)->set2.aggr ##reform the data for plot rbind(set1.aggr[,-1],set2.aggr[,-1])->medi.dat ##plot it plot(medi.dat[1,],medi.dat[2,])
with many thanks
thanks for your reply, Devon Ryan. I think you mean that those 100 genes randomly selected can represent the whole ~20000genes. that is reasonable. but it is not always the case. I have try the code in this post for other independent data sets, there are cases that it show no correlation at all.
Actually this is a question raised by a interesting hypothesis in my research: we observed that the oxidative phsorylation function was disturbed in our case samples, so we hypothesized that oxidative phosphorylation genes' expression profile must be different from the 'overall expression profile'(we use randomly selected gene to represent this overall expression profile, this is in accordance with your reply~_~), but to my surprise we find that there is always a high correlation between OXPHS genes expression and randomly selected genes' expression profile in my data. So I check this hypothesis in other independent data sets, it turns out that in some data sets the phenomenon holds while in others it didn't.