Hi, I am new at biostatistics and I am interested to understand some stuff on the quantile normalization technique.
Let's assume that I have a microarray dataset that contains a control set (x4 repeats) , a treated condition (x4 repeats) and i want to look for diff expressed genes.
So the most common first step i saw on the net is to normalize data with quantile normalization technique. Before normalization data density functions looks like this :
By assuming now that there are no differentially expressed genes, we can apply the quantile normalization method to all of our data and make them to have the same distribution.
So here is where I'am getting a little bit confused. Till now i knew that if I'am going to run a t-test for a specific gene between these two conditions (control VS treated) I'm going to get a t-value on the null hypothesis distribution and then calculate the p-value. If that p-value is enough smaller from the a=5% threshold, then i could say that there is such a probability to get a difference like this or more extreme between these two conditions if my null hypothesis is true. In many cases in such situation we say that we can reject the null hypothesis and to support that this difference is due to the treatment that actually has its own (different) distribution compared to the control's.
So how is this going to work if we turn the two distributions into one with quantile normalization ?
Is right the way I'm thinking of it or am I missing something ?
If I am missing something then another way to represent variability except the density function of the conditions will probably prove that variability still exists between those two conditions. So, is there any other way to represent such information (variability) ?
Thank you.
This paper will really help your understanding of when to use quantile normalisation: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0679-0
Thanks. I'm gonna read it.
I am not sure, but I think that you apply the quantile normalization to each distribution, and the you get in output as many distribution as you had in the input. See e.g. https://en.wikipedia.org/wiki/Quantile_normalization
You mean that you apply the quantile normalization on control samples and treatment samples separately ?
I think this is incorrect. Quantile normalisation creates a reference distribution from all your samples which you use to normalise each of them separately. This is appropriate when there is small variability across groups. When there are large differences in the distributions between your groups (i.e. samples from different tissues), then quantile normalisation can mask real differences. Instead you can use smooth quantile normalisation.