Combining Gene Expression Microarray Datasets
4
6
Entering edit mode
14.0 years ago
Saman ▴ 260

Hi, I am trying to combine several microarray dataset downloaded from GEO, all made by the same technology (GPL96) and normalized with the same algorithm (RMA). I thought all of these similarities between them make them statistically comparable but it seems I was wrong.

A simple hierarchical clustering based on Euclidean distance shows that instances of each dataset are cluster together!

I read about algorithms like DWD (Distance Weighted Discrimination) method for combining datasets but still I have a hard time using it mainly because it doesn't have an R implementation.

Any suggestions here?

Thanks in advance

--Saman

microarray gene data meta • 13k views
ADD COMMENT
0
Entering edit mode

I think, you should not use euclidean distance in this case. Pearson based distances would be a better choice.

ADD REPLY
0
Entering edit mode

I am not quite sure what do you mean?! Not using Euclidean distance for what?!

ADD REPLY
0
Entering edit mode

For the clustering. You are using Euclidian distance for the clustering, but there are other possible choices to measure the distance between two profiles. See wikipedia "euclidian distance" for more details.

ADD REPLY
0
Entering edit mode

Thanks both of you, I already forgot my post!! So you mean that if I use Pearson correlation for distance then I wouldn't see that effect?! I can check that. I will let you know whether this makes a different or not.

ADD REPLY
0
Entering edit mode

Can i take some cel files for disease1 from experiment1 and some cel files for the same disease1 from experiment2 and similarly ,,taking raw data and then normalizing together ,is it a good idea ?

ADD REPLY
0
Entering edit mode

Hi Saman

I am keen to combine multiple GEO datasets (all run on Affymetrix U133 plus 2.0) and came across your thread. I was wondering what approach you ended up using in order to combine your datasets? I would appreciate any help.

Thanks

ADD REPLY
10
Entering edit mode
14.0 years ago

If a lab generates 100 aliquots of RNA from 100 subjects and runs the same aliquots four months apart at the same core facility, I would be unsurprised to see them cluster separately. There are batch effects you introduce even with that level of replication; taking two different experiments, run by two different labs, etc. and not renormalizing the data, and it would be very surprising if you didn't see that.

Start out a more basic point:

You haven't said anything about the experiments you're using as raw data. Are the experiments purportedly measuring the same thing? (e.g. lung adenocarcinomas from early stage tumors, mouse skin treated with UV radiation, whatever) This is the biggest issue. There may be very good biological reasons why the experiments cluster separately, even aside from technical batch effects. Combining other people's data without studying the individual data sets and knowing something about the biological context can be very misleading. I'm not assuming that is what you are doing, but you haven't said anything about this.

For practical suggestions, I would suggest you renormalize the combined data sets together from the CEL files and use a tool such as ComBat to adjust for the known between-experiments batch effects. If you don't have the CEL files, I suggest that at least you use ComBat.

ADD COMMENT
1
Entering edit mode

Whether you want to normalise these together though, I'm not sure it's a great idea. I think you should normalise them separately and then use an appropriate meta-analysis method to analyse them. At this level, you probably don't even need to use ComBat - you can treat them as combined, but separate experiments, rather than attempting to push them all through one giant normalisation/batch effect removal step.

ADD REPLY
0
Entering edit mode

Thanks for your fast response. All microarray samples belong to breast cancer patients with more or less the same conditions. My main purpose is to learn a better model using a wider range of training samples. I actually downloaded raw files for each dataset and normalized them separately. I thought, and still think, that normalizing different datasets together is not a good idea, aside the problem that using R for normalizing 1000 instances needs more than 8GB memory! Is there any reason to believe that normalizing them together is a good idea?

Thanks again

ADD REPLY
0
Entering edit mode

Samam there is already a discussion about CEL file normalisation with large numbers of chips here: Normalize Large Number Of Cel Files

ADD REPLY
0
Entering edit mode

Thanks. I read them, the main issue in that thread is memory limitation. My main concern is validity of normalizing several datasets together. Any references here?!

ADD REPLY
0
Entering edit mode

Another approach would be to replace "normalize together" with "median-center the datasets".

ADD REPLY
0
Entering edit mode

I have seen some studies that tried median centering data in every possible combination, whole dataset first, then each gene separately, then again dataset, ... The bottom line was that there was no improvement.

I have tried making each gene/probe-set z-score in each dataset and observed that it really doesn't matter in the accuracy of prediction.

ADD REPLY
3
Entering edit mode
14.0 years ago
User 59 13k

First of all I endorse David's answer entirely, ComBat.R is the R implementation you want to use to remove dataset bias in this case.

The DWD approach, the paper claims, allows you to combine datasets but really it adjusts for systematic bias, rather the same thing ComBat.R does. I realise the authors in the paper argue that you can combine different array platforms using this technique, but it doesn't look like a traditional meta-analysis approach.

Combining ComBat.R with a dedicated meta-analysis package in the BioConductor arsenal may be the way to go: GeneMeta or metaArray or RankProd might suit you.

ADD COMMENT
0
Entering edit mode

I have seen ComBat.R but for some reason I didn't try it, I will try it and let you know how it works. Thanks.

ADD REPLY
1
Entering edit mode
11.1 years ago
avi4you ▴ 20

hello every one i am a student of genetics doing my masters i am working on Diffrential gene expression in avian influenza virus infection in chicken, we have used microarray to know this. what i want to do is i want to analyze two different microarray raw data with available from public database to compare with my data, but as i am a beginer i dont know how to deal with Raw data normalization to compare them and also dont know how to deal with batch effect . can any one help me regarding this topic??.. thank you

ADD COMMENT
0
Entering edit mode
13.9 years ago
Timtico ▴ 330

Or one could use an internal control. Calculate ratio's with genes from which you know they should be expressed equally in any of the datasets?

ADD COMMENT
1
Entering edit mode

I think actually picking something sensible as a housekeeping baseline is very hard indeed. Every time I look at a classic 'housekeeping' gene in a microarray dataset, I'm surprised just how variable they can be.

ADD REPLY
0
Entering edit mode

that can indeed be an issue, but in our arrays in many cases the actin levels are equal and we can define a %-actin as a value for the level of gene expression when comparing different arrays.

ADD REPLY

Login before adding your answer.

Traffic: 2446 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6