Hello,
I've read a lot of posts about integrating microarray datasets but couldn't get my answer clearly for this specific scenario. I have 4 GSE numbers of microarray datasets which have different platforms. Their biological groups are as follows:
- GSE1: TypeA, TypeB, TypeC, TypeD
- GSE2: TypeA, TypeE
- GSE3: TypeE, TypeF
- GSE4: TypeF, TypeG
As you can see each dataset has one biological group in common with next dataset. (there would be no way to distinguish batch effect from biological effect if there were no groups in common)
I can't analyse each dataset separately because for example I want to perform a differential expression analysis between TypeB and TypeF. So I need to merge all these datasets into one.
here are my questions:
- Is it possible to merge these datasets into one? (considering I want to perform a differential expression analysis (DEA))
- Should I normalize them before merging ? if yes what normalization techniques should I use?
- How to merge them and remove batch effect? (for example if I want to use limma for DEA)
Any help would be greatly appreciated.
Thanks in advance
Don't do that. Your results will be confounded as you have different platforms. You can only remove a batch effect if you have replicates from each batch in the single groups which you apparently don't, therefore no way to identify and distinguish batch from biological effect. There are limitations towards what data analysis can and cannot do. Combining independent datasets at will while producing valid results is imho none of it. What you ask can most likely not be done with the available data. I know it is frustrating, I had the same issue many times, high-quality data being available via download but not suited for what I wanted to do with them. Forcing them into wrong analysis would probably only produce artefact results without deeper meaning.
thanks for your reply. Having a biological group in common wouldn't be any helpful?
No, the differences in platform are generally impossible to overcome in any meaningful way in this case. If you had the same samples on each and every platform, then maybe you could get something that's mildly believable after a lot of hassle, but as @ATpoint said, even a case such as that would be looked at with a lot of skepticism.
Thanks for your reply. another question, for example i want to check for a Gene if there is a significant upregulation from TypeB to TypeE. i perform DEA on GSE1 for TypeB vs TypeA and a separate DEA on GSE2 for TypeA vs TypeE. now if there was an upregulation from TypeB to TypeA and an upregulation from TypeA to TypeE, Could i conclude there is a significant upregulation from TypeB to TypeE?
You could maybe use it to justify doing qPCR to validate that, but realistically? No. Definitely not with any valid statistical backing.
if platforms were the same would this analysis be possible to perform? (based on groups i mentioned)
Sure, if the samples were all run on the same platform, that'd eliminate the majority of your issues. You could compare the samples any which way you might want. You might still have to deal with batch effects, but there are established methods to help deal with that.
Not necessarily. I am not too much a microarray guy but in RNA-seq from what I've seen myself results are strongly confounded by the library preparation method even if run on the same Illumina platform. I guess in the array world you also have choices on how you isolate RNA, how to make cDNA and make the PCR enrichment. That comes down to the same problem, not having replicates to separate batch from true effect leaves a lot of uncertainty that might skew your analysis.
Can I consider these platforms as the same: Affymetrix Human Genome U133A+U133B and Affymetrix Human Genome U133 plus?
It doesn't matter:
As I said before, I personally vote for ndoing what you want to do because you cannot control the batch effect. It is not only the platform, also the library prep, RNA extraction etc. And no, the probably are not the same, otherwise the company would not have put them as three products.