Question

Should normalization of expression data be performed over the entire dataset or only over the used subset of data?

1

Entering edit mode

5.0 years ago

n,n ▴ 370

I have an expression matrix derived from an RNA-seq pipeline, the matrix contains about 50,000 genes (rows) and 10,000 samples (columns). I've been taking subsets of this data (1000 random samples) to perform some basic downstream analysis (mostly exploratory co-expression related analysis), however it is my first time working with expression data and I'm unsure of how to handle data normalization.

I'm using quantile normalization as part of my pre-processing of the data before calculating correlations and such, however I'm currently normalizing after taking the subset of 1000 samples. I initially never thought about normalizing the whole dataset before subsetting my samples because it isn't easy computationally speaking due to the size of the matrix and I kind of just discarded that possibility without thinking about it, however I recently became bothered by the idea that it makes more sense to me to first normalize the whole dataset and then take my random sample (which is intended as a representative subset of the whole dataset).

Maybe it depends on the specific downstream analysis if normalization must be done before or after subsetting the data, but I think for my specific case normalizing after subsetting doesn't make much sense. In spite of this I feel like I need a more expert opinion from someone who has worked with similar data before, any help is greatly appreciated.

RNA-Seq quantile normalization co-expression • 2.6k views

ADD COMMENT • link updated 5.0 years ago by Gordon Smyth ★ 7.7k • written 5.0 years ago by n,n ▴ 370

0

Entering edit mode

IMO your question should be about sampling the data vs using the whole dataset. Why are you subsetting?

ADD REPLY • link 5.0 years ago by Mark ★ 1.6k

score 2 · Answer 1 · 2019-11-29

2

Entering edit mode

5.0 years ago

Gordon Smyth ★ 7.7k

In principle, you should normalize all the samples together. However, any random sample of 1000 columns will give almost the same row means as the whole set of 10,000 columns and hence quantile normalizing random subsets of 1000 columns will give almost the same result as if all 10,000 samples were quantile normalized together.

On the other hand, I don't see the computational issue. Quantile normalizing a 50,000 x 10,000 matrix takes only a couple of minutes on my PC:

> x <- matrix(rnorm(50000*10000),50000,10000)
> library(limma)
> y <- normalizeQuantiles(x)

ADD COMMENT • link 5.0 years ago by Gordon Smyth ★ 7.7k

0

Entering edit mode

thank you for your advice, the real problem I have with the data is that I can't load it all at once. I get a memory error everytime I try it with read.table()

ADD REPLY • link 5.0 years ago by n,n ▴ 370

1

Entering edit mode

What version of R are you using? Make sure it's 64bit. Also use fread from data.table to read in your file.

ADD REPLY • link 5.0 years ago by Mark ★ 1.6k