Hi everyone,
I'm working on a meta-analysis involving 60< RNA-seq and array (NanoString) datasets with normalized gene expression values. However, the normalization methods vary: some use TPM, others FPKM or even CPM.
Currently, I'm merging all these datasets into a single expression matrix and applying quantile normalization on the merged matrix before performing DGE analysis.
Is this approach scientifically sound, or should I consider a different approach for integrating these datasets?
I also have access to the FASTQ files, so I could reprocess the data from scratch if that is the more appropriate approach.
Thanks for your help in advance!
Isn't the point of meta-analysis to basically analyze each dataset, due to heterogeneity across datasets, individually and then combine results into a meta-ranking to reveal the consistent, therefore reproducible, trends? I would never combine such heterogeneous data into a single quantitative analysis.
Could you please elaborate on why you would never combine them into a single table? There are several papers that do this, e.g. this one, which states: "Gene expression data from all eligible datasets were combined into a single table, quantile normalized and scaled to 1000."
Because data are heterogeneous, on entirely different scales, assay (vastly) different amounts of genes and are inherently incomparable quantitatively. Just because there is a paper doing something does not mean its good or meaningful. That is my take on it, you are free to disagree.