Hi! I'm new to bioinformatics and I'm working with 6 different RNA-seq(high throughput) studies from GEO, 3 of the GSEs contain gene expression for tumors and the other 3 contains healthy tissue.
I'm going to do batch correction, and I'm wondering do I merge all datasets together first and do normalization and batch correction on all together? Or do I merge the 3 GSEs for tumor-data and do normalization and batch correction on this merged dataset separately and then merge the 3 GSEs for healthy tissue and do the normalization/batch correction there, and then merge them all together if that make sense?
You must make sure that all of the datasets follow same protocol and library preparation steps. Only then you can apply batch normalization. Otherwise your data will give completely unreliable output and sometimes you will never realize that.
No, you don't. You cannot randomly collect datasets and expect to then run any stats magic and make them comparable. You need indentical wetlabl processing for a fair comparison. Otherwise batch effects obscure the results. You cannot correct it as each batch (=each study) is nested with the condition (tumor/normal). A very common problem, and the only way around is to either find a study that produced case and control in go, or make the data yourself with proper study design. You have with these data above a fully confounded design, nothing you can do about it.
Oh, I see you asked this before and the answer was the same:
Yeah, that limits a lot of things unfortunately. I would discuss with your supervisor to limit on something for which proper data exist, and on something that has not been shown before. After all computational analysis is just a starting point, and needs experimental validation. If you start with batch-confounded data it is very likely that you stack up uncertain results and in the end you might just be investigating batch effects rather than genuine biology. There is probably no magical way to make the data you need usable, simply because case and control always come from different batches. It's what it is. Be careful not to spend much time on suboptimal data. Rather try and adjust the topic if possible.
You must make sure that all of the datasets follow same protocol and library preparation steps. Only then you can apply batch normalization. Otherwise your data will give completely unreliable output and sometimes you will never realize that.