Question

Questions Regarding Normalization and Standardization in RNA-seq Differential Analysis

1

Entering edit mode

5 weeks ago

wdasda122 ▴ 10

Dear All,

I have been trying to learn about RNA-seq differential analysis and have encountered some questions regarding normalization and standardization. My questions primarily focus on the following points:

Is standardization (also known as Z-Score Normalization) a subset of normalization? Are the two concepts related in such a way that normalization encompasses standardization?
In differential expression analysis, or more broadly in bioinformatics, do normalization and standardization have essential differences?
In the DESeq2 literature, the term "normalization" is frequently used. However, after closely examining its steps and principles, I feel that it aligns more with the meaning of standardization. Is it because DESeq2 does not use Z-Score standardization, hence the use of the term "normalization" for the sake of accuracy in the article's narrative?
My current understanding is that standardization (not limited to Z-Score here) aims to eliminate systematic errors/bias, while conventional normalization techniques, such as logarithmic transformation or min-max normalization, primarily serve to scale the data. Is this understanding correct?
I am still struggling to distinguish when to use normalization versus standardization in daily bioinformatics analysis. How can I determine which method to apply in different scenarios?
Can normalization and standardization be used simultaneously? If so, is there a specific order for applying them? Are there theoretical foundations guiding whether to use one before the other?

During my studies, I have consulted many resources, but the more I read, the more confused I become. Therefore, I am seeking some assistance.

I appreciate any insights or clarifications you can provide!

Thank you!

RNA-seq Normalization • 403 views

ADD COMMENT • link updated 5 weeks ago by Ram 44k • written 5 weeks ago by wdasda122 ▴ 10

score 0 · Answer 1 · 2024-10-11

In the bioinformatics field the term normalization describes the process to correct data (for example RNA-seq raw counts) for technical biases, mainly sequencing depth (how many reads were sequenced per sample) and library composition. Let me refer you to this great StatQuest for an extensive explanation of how tools (in this case DESeq2) handle this:

1) Yes, standardization is the subtraction of counts per gene from its mean and division by standard deviation. This is also calle Z-score, and only makes sense if data have been normalized for depth and composition, and typically is even done after log2-transforming these normalized counts.

2) I am not aware of approaches that do DE analysis on Z-scored counts.

3) No, it's normalization, see 2).

4) Normalization eliminates biases, and standardization scales the data. As a matter of fact, the R Z-scoring function is called scale().

5) You always normalize your data. Standardization makes sense for some analysis, such as clustering and heatmaps.

6) Yes, standardization comes after normalization, so you use both.

score 0 · Answer 2 · 2024-10-11

I think its fair to say that standardization is one form of normalisation, that applies in particular cases.

While the term "normalization" may imply that it is about transforming a dataset to fit the normal distribution, even if this was once the case, it is no longer so, at least not in applied statistics displines like bioiformatics. Normalisation reffers to the process of making two datasets comparable by removing systematic effects that we are not interested in.

We often talk about normalisating for something. E.g. "Normalising for the effect of variation in library size" in RNA-seq or "Normalising for the varition in starting material" by using a housekeeping gene in qPCR.

Standardization is a technique that transforms data is that is on, or is assumed to be distributed as a gaussian distribution and transforms it to being on a standard normal distribution by centring in the data on a mean of 0 and scaling it to have a standard deviation of 1.

Normalisation methods may change many or few moments of the distribution of the data, although most often it only changes the mean. This is the case with DESeq2 normalisation, which does infact calculate factors such that the mean log fold change across genes 0, but doesn't systematically change the standard deviation of the logfoldchange across genes, but will, usually reduce the standard error of the logfoldchange estimate for any one gene.

In summary, best to think of standardization as one very specific example of a normalisation technique. It is most applicable when you want to take a number of variables which have a guassian distribution and put them on a comparable scale wihtout caring about the source of their differing distribution.