Can you please explain the main core problem with RPKM normalization (as a measure of relative abundance), using a simple example, and why TPM solves this? Different explanations for why the RPKM unit is bad are: (a) it uses length normalization, (b) it normalizes to total library size, (c) because average RPKM value is not maintained across samples, or (d) few transcripts dominate the whole, or (e) assumption that total RNA is same across samples. Which is the main correct explanation?
1. This source gives an example of two equal length genes. In condition 1, Gene A has 50,000 reads and Gene B has 0. In condition 2, Gene A has 50,000 reads and Gene B has 10,000 reads. The average RPKM in condition 1 and condition 2 is the same, so condition (c) isn't violated. Source says Gene A has different RPKMs in condition 1 and condition 2 - why is it problem? If reads sampled in proportion to length and expression there could not be 50,000 reads for gene A in both. Is this really a good example of RPKM giving the wrong answer?
2. The Wagner paper says the problem is normalization by total read length too. They say it leads to different average RPKMs per sample: "The reason for the inconsistency of RPKM across samples arises from the normalization by the total number of reads." In the example above this is not a problem. Average RPKM is the same in both conditions. Only when the gene lengths are different that you get different average:
Condition 1: Gene A: 50,000 reads, Gene B: 0 reads
Condition 2: Gene A: 50,000 reads, Gene B: 10,000 reads
If length A is 1000 and length of B is 10 then average RPKM for condition 1 is 0.5 and for condition 2 it is 8.75, so avg RPKM is different.
3. The same source as #1 above says that RPKM is dominated by a few highly expressed genes since most of read counts come from those. How is that different from TPM?
4. This says the problem with RPKM is the assumption of same total RNA per sample. It gives the example: "Suppose you have two RNA populations A and B sequenced at same depth A and B are identical except half of genes in B are unexpressed in A. Only half of reads from B come from shared gene set. Estimates for shared genes differ by factor of ~2". Can someone clarify this with an example?
Lior Pachter, the guy who introduced FPKM after Ali Mortazavi introduced RPKM, gave a very nice talk exactly about this at the Cold Spring harbour meeting some time ago (I reckon it was 2013). Here is the filmed talk starting at the appropriate time.
He there very carefully and very nicely explains why RPKM / FPKM (which is in fact the same just single / paired end reads) is not the best unit to use comparing RNA-seq experiments.
IMHO it's worth watching the whole video ;)
There's more than one core problem with RPKMs (as you seem to have noticed), but the deal-breaker will depend on your goals. It should also be noted that for standard gene-level differential expression, conversion to RPKM also loses precision information, so you can't as accurately weight samples when you're trying to estimate parameters (e.g., in a linear model).
The link for the EBI training presentation by Ernest Turro in #4 is broken. Here is the corrected link: https://www.ebi.ac.uk/sites/ebi.ac.uk/files/content.ebi.ac.uk/materials/2012/121029_HTS/ernest_turro_normalising_rna-seq_data.pdf