I have a dataset of 17 cancer cell samples which have been RNA-sequenced. The provide has normalised the data both with FPKM and TPM normalisations. However, when I plot the global expression values (just the control samples), it produces an curve like the one in the image attached.
Intuitively, I think the normalisations should follow the red line of unity?
How might I explain why the normalised expression values diverge at higher expression-values?
Please show the functions to calculate this. Both transformations are simple division and multiplication operations with the column sums of the count matrix and the gene length, there is no non-linear element in this, the plot is odd. Only difference between both is the order of operations. TPM first divides out the gene length and then does the column sum operation, whereas FPKM first divides the column sums and scales to a million before dividing the gene length.
Regardless, as it was shown many times before, simple operations like that fail to correct for library composition effects, see for example here or TMM-Normalization. Just because everyone uses these metrics mostly wrong does not mean people should keep doing that. Normalization efficiency of these methods is often poor, and in a differential analysis with tools like limma they're problematic because the length correction distorts the normal mean-variance trend based on the observed sequenced counts, which is a key point of these sorts of analysis methods that aim to learn from the entire dataset how much variance we expect at a given expression level from a certain gene, and then in turn decide whether the observed differences between groups are sufficiently large to quality the gene for differential expression, rather than differences being a likely artifact from the normal expected variance.
Looking at the range of the data, it seems your input FPKM is very likely on the log scale, i.e. your input is log(FPKM). The TPM data instead is on the linear scale. I think that is the cause of the apparent massive discrepancy.
Good catch, that makes sense. As I said above, mathematically I don't see a reason for a non-linear relationship if on normal scale, and I also don't agree with the answer from Istvan as all that is different between the metrics is the order of operations.
I think both are logarithmic axes. Just one is in exponential notation the other is not.
Yes, the axes are both logarithmic, the difference is in the input data with FPKM being log'd and TPM being raw.
Go with TPM, when you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.
Thank you for your answer! Do you have any sense of why the expression values diverge at higher-expressed genes? I can't see anything in the maths that suggests this should be the case
I don't have a better mathematical explanation, but FPKMs are not reliable for gene expression studies, there are many discussions about this topic everywhere..