Question

Is it acceptable to combine RNA-seq datasets?

1

Entering edit mode

6 months ago

imraandixon ▴ 10

Hi,

To get some preliminary data I would like to look at online databases like GEO for RNA-seq data. If I combine different datasets to increase the sample size, would this be okay? For instance, if I wanted to compare expression in between different brain regions, could I combine datasets obtained from the surrounding tissue resected during epilepsy surgeries?

If I can, what are the caveats and measures I would need to take to ensure reliability in the results? Can I work from the the raw reads or the count data? Otherwise, is it just not worth it to combine data like that?

Thanks in advance!

RNA-seq Data-mining • 409 views

ADD COMMENT • link 6 months ago by imraandixon ▴ 10

score 4 · Accepted Answer · 2024-06-19

The general answer is that you typically would not combine independent datasets in a single quantitative analysis, as each dataset is only valid within itself, because RNA-seq is a relative, rather than absolute measure, and dataset-specific batch effects would overshadow the true biological readouts.

Generally, what you could do is to analyze each dataset separately (if that is applicable here) and then make some sort of meta-analysis to see whether the individual findings (even if produced in underpowered individual studies) hold true across studies.

Very general speaking, what you can quantitatively compare is data that have been produced in the same ab, same time, same batch, same kit, setup, sequencing...(everything), and then processed identically in silico. Everything else (even if people often ignore) that holds a great chance of yielding many "significant" hits that are entirely technical/batch-driven and have no biological meaning. That is actually basic property of any experiment in any context, but so often ignored by researchers that it's worrying.

Classical example: Control on day1 and treatment on day2, followed by quantitative analysis, any result can entirely be technical, and just because your one single gene that you know must be differential is included among the top hits does not qualify as a validation or test of robustness of the assay. It merely tells you that the experiment was not done completely and absolutely terribly wrong, but that's it.