TPM vs FPKM divergence at high values
1
0
Entering edit mode
11 days ago

I have a dataset of 17 cancer cell samples which have been RNA-sequenced. The provide has normalised the data both with FPKM and TPM normalisations. However, when I plot the global expression values (just the control samples), it produces an curve like the one in the image attached.

Intuitively, I think the normalisations should follow the red line of unity?

How might I explain why the normalised expression values diverge at higher expression-values?

scatterplot of normalised gene expression plotted against each other

tpm fpkm normalisation • 778 views
ADD COMMENT
1
Entering edit mode

Please show the functions to calculate this. Both transformations are simple division and multiplication operations with the column sums of the count matrix and the gene length, there is no non-linear element in this, the plot is odd. Only difference between both is the order of operations. TPM first divides out the gene length and then does the column sum operation, whereas FPKM first divides the column sums and scales to a million before dividing the gene length.

Regardless, as it was shown many times before, simple operations like that fail to correct for library composition effects, see for example here or TMM-Normalization. Just because everyone uses these metrics mostly wrong does not mean people should keep doing that. Normalization efficiency of these methods is often poor, and in a differential analysis with tools like limma they're problematic because the length correction distorts the normal mean-variance trend based on the observed sequenced counts, which is a key point of these sorts of analysis methods that aim to learn from the entire dataset how much variance we expect at a given expression level from a certain gene, and then in turn decide whether the observed differences between groups are sufficiently large to quality the gene for differential expression, rather than differences being a likely artifact from the normal expected variance.

ADD REPLY
1
Entering edit mode

Looking at the range of the data, it seems your input FPKM is very likely on the log scale, i.e. your input is log(FPKM). The TPM data instead is on the linear scale. I think that is the cause of the apparent massive discrepancy.

ADD REPLY
0
Entering edit mode

Good catch, that makes sense. As I said above, mathematically I don't see a reason for a non-linear relationship if on normal scale, and I also don't agree with the answer from Istvan as all that is different between the metrics is the order of operations.

ADD REPLY
0
Entering edit mode

I think both are logarithmic axes. Just one is in exponential notation the other is not.

ADD REPLY
0
Entering edit mode

Yes, the axes are both logarithmic, the difference is in the input data with FPKM being log'd and TPM being raw.

ADD REPLY
0
Entering edit mode

Go with TPM, when you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.

ADD REPLY
0
Entering edit mode

Thank you for your answer! Do you have any sense of why the expression values diverge at higher-expressed genes? I can't see anything in the maths that suggests this should be the case

ADD REPLY
0
Entering edit mode

I don't have a better mathematical explanation, but FPKMs are not reliable for gene expression studies, there are many discussions about this topic everywhere..

ADD REPLY
0
Entering edit mode
8 days ago

I admit I had no firm idea what a TPM vs FPKM plot might look like.

I never computed both, but this motivated me enough to take a salmon quant.sf and make an FPKM with it (salmon provides the TPM) when I plot TPM vs FPKM, I get this (code is ChatGPT generated but looks legit :-) )

enter image description here

(Edit: I also removed my previous opinion that it is caused by the metrics itself since that explanation relates to situations where the transcript size changes between conditions not for within one sample)

ADD COMMENT

Login before adding your answer.

Traffic: 3099 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6