I have a set of RNA-seq gene expression data (normalized by RPM method) from different samples. I need to normalize gene expression values between 0 and 1.
Let:
MIN is minimum gene expression across all genes and all samples
MAX is maximum gene expression across all genes and all samples
MIN_X is minimum expression of gene X across all samples
MAX_X is maximum expression of gene X across all samples
Gene expression values for gene X can be calculated by one of following ways:
1: This results in very small gene expression values:
normalized gene expression X = (gene expression X - MIN ) / (MAX - MIN)
2: This results in loss of information about ratio of gene expression values:
normalized gene expression X = (gene expression X - MIN_X) ./ (MAX_X - MIN_X)
Is there a way to normalize gene expression values in the range [0,1] by avoiding the above issues?
Why don't you normalize with Z-score. That will indicate for each gene how much each sample deviates from the mean off all samples (per gene)?
Transform your counts to log2 and then do
t(scale(t(log2matrix)))
.I think by Z-score normalization the expression ratio between genes will be lost. Won't it?
Yes, that is exactly the point. Each gene can be visualized on the same scale. If you normalize between zero and one then this will be dominated by highly-expressed genes so it is difficult to see differences between genes that are not highly-expressed.
What if I do Z-score normalization by mean and standard deviation of expression of all genes in all samples? Does it have any benefit over using gene-specific mean and standard deviation values? May you please explain about t and scale functions in
t(scale(t(log2matrix)))
?You want to divide by gene-specific SD since you have thousands of genes so the mean is not representative for every single gene. The
t
stands for transpose, and since scale will by default operate column-wise we have to transpose the matrix first to have it row-wise, and then after the job is done transpose it back so genes are again rows and samples are again columns.