The reason for the upper quartile normalisation is because of proportionality and sequencing real estate issues.
Consider two samples with the following number of transcripts per gene:
| A | B |
gene 1 | 10 | 10 |
gene 2 | 10 | 10 |
gene 3 | 10 | 10 |
gene 4 | 70 | 170 |
Now if we take 1 million reads from each sample we'll get the following read counts:
| A | B |
gene 1 | 100k | 50k |
gene 2 | 100k | 50k |
gene 3 | 100k | 50k |
gene 4 | 700k | 850k |
That is, the increase in expression of the highly expressed gene 4 has sucked sequencing real estate away from genes 1-3, even though they haven't actually increased in expression. This is not a freak accident: gene expression levels tend to be log normally expressed and so the top few genes will take up a large fraction of the reads in any experiment and even a small change in their expression could have major effects on the reads left for other genes. By excluding the most highly expressed genes when we calculate our normalisation factors, we partially avoid this effect.
This argument is possibly best laid out in Robinson et al, although they propose a different solution to upper quartile normalisation, one that only works in a differential expression context. Anders et al also go though it, again with their own conclusion on the best normalisation method. As far as I can tell the first reference for UQ normalisation in RNAseq is Bullard et al.
Here is the page from GDC for FPKM-UQ.
Thank you response. My problem is biological concept of the formula. Why " the total protein-coding read count is replaced by the 75th percentile read count value for the sample."?
My understanding is, in statistics view, using 75th percentile read count value to normalize the sample will be less affected by the outliers.
But this just statistic define. why it is related with protein.
They introduced that part (the 'UQ' part) in order to "facilitate cross-sample comparison and differential expression analysis"
[source: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification]
I know that many lecture show FPKM/RPKM are not good. I have a idea that the FPKM/RPKM are not like normalization for each sample. Right?
If you are just looking at a single sample, i.e., n=1, use of RPKM/FPKM units is generally fine. When you have n>1, the problem is that the normalisation method that produces RPKM/FPKM will normalise each sample differently, and the main parameter that affects this is the depth of coverage at which each sample was sequenced. So, a RPKM/FPKM expression value of 200 in one sample is not equivalent to 200 in another sample.
In theory, we can sequence 2 samples to the same target depth of coverage to overcome this; however, in practice, biases always exist and they will be sequenced at different depths.
Hope that this makes sense.
My opinion the same as you. I have use RMA, after FPKM or others. I think I must adjust sequence issue, then adjust batch effect within samples. Right? Thank you.
If you follow what Ian (i.sudbery) is saying and decide to use DESeq2, then you can adjust for batch effects by just including the batch variable in the design formula.
I also used RMA in the past for microarrays. It has taken some time for adequate normalisation methods for RNA-seq to be developed. DESeq2 and EdgeR are very popular, though.
Take a look here in order to get started with DESeq2: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#can-i-use-deseq2-to-analyze-paired-samples (batch is mentioned at the beginning, under Quick start)