I have an expression matrix derived from an RNA-seq pipeline, the matrix contains about 50,000 genes (rows) and 10,000 samples (columns). I've been taking subsets of this data (1000 random samples) to perform some basic downstream analysis (mostly exploratory co-expression related analysis), however it is my first time working with expression data and I'm unsure of how to handle data normalization.
I'm using quantile normalization as part of my pre-processing of the data before calculating correlations and such, however I'm currently normalizing after taking the subset of 1000 samples. I initially never thought about normalizing the whole dataset before subsetting my samples because it isn't easy computationally speaking due to the size of the matrix and I kind of just discarded that possibility without thinking about it, however I recently became bothered by the idea that it makes more sense to me to first normalize the whole dataset and then take my random sample (which is intended as a representative subset of the whole dataset).
Maybe it depends on the specific downstream analysis if normalization must be done before or after subsetting the data, but I think for my specific case normalizing after subsetting doesn't make much sense. In spite of this I feel like I need a more expert opinion from someone who has worked with similar data before, any help is greatly appreciated.
IMO your question should be about sampling the data vs using the whole dataset. Why are you subsetting?