Dear Biostars users,
I would like to ask question about z-score normalization (standardization) on gene-expression data.
As you can aware from the title, I would like to ask which one is the good way to normalize gene expression data?
If I check examples for gene-expression data on the internet usually people use sample-wise normalization, however, when I check the examples on the machine-learning examples or any other examples people usually use feature-wise normalization.
I wonder what is the clear difference between these two methods?
so lets say we have DF like this;
sample_0 sample_1 sample_2 sample_3
gene0 5.1 3.5 1.4 0.2
gene1 4.9 3.0 1.4 0.2
gene2 4.7 3.2 1.3 0.2
gene3 4.6 3.1 1.5 0.2
gene4 5.0 3.6 1.4 0.2
... ... ... ... ...
gene145 6.7 3.0 5.2 2.3
gene146 6.3 2.5 5.0 1.9
gene147 6.5 3.0 5.2 2.0
gene148 6.2 3.4 5.4 2.3
gene149 5.9 3.0 5.1 1.8
This is the sample-wise z-score normalization (calculate mean of each sample and subtract from data)
sample_0 sample_1 sample_2 sample_3
gene0 -0.900681 1.019004 -1.340227 -1.315444
gene1 -1.143017 -0.131979 -1.340227 -1.315444
gene2 -1.385353 0.328414 -1.397064 -1.315444
gene3 -1.506521 0.098217 -1.283389 -1.315444
gene4 -1.021849 1.249201 -1.340227 -1.315444
... ... ... ... ...
gene145 1.038005 -0.131979 0.819596 1.448832
gene146 0.553333 -1.282963 0.705921 0.922303
gene147 0.795669 -0.131979 0.819596 1.053935
gene148 0.432165 0.788808 0.933271 1.448832
gene149 0.068662 -0.131979 0.762758 0.79067
and this is the feature-wise z-score normalization (calculate mean of each feature(gene) and subtract from data)
sample_0 sample_1 sample_2 sample_3
gene0 1.351023 0.503322 -0.609285 -1.245060
gene1 1.431365 0.354298 -0.552705 -1.232958
gene2 1.358472 0.491362 -0.606977 -1.242858
gene3 1.358655 0.452885 -0.513270 -1.298270
gene4 1.311925 0.562254 -0.615801 -1.258377
... ... ... ... ...
gene145 1.370869 -0.742554 0.514076 -1.142391
gene146 1.321102 -0.792661 0.597972 -1.126413
gene147 1.311682 -0.662893 0.578268 -1.227057
gene148 1.208577 -0.596232 0.692918 -1.305264
gene149 1.195060 -0.582209 0.704779 -1.317631
150 rows × 4 columns
I think you need to be clear on the difference between normalisation and standardization. z-score is not a good way to normalise gene expression data. However, it can be useful in some circumstances to standardize already normalised data. There is no one recommended way (or even whether to do standardisation at all), and it depends on the purpose of your analysis.
Side note: The word is subtract, not substract. - there's no s in the middle. I've corrected the word in your post.