Question

Calculate TPM from cuffnorm count data

2

Entering edit mode

6.9 years ago

galozs ▴ 20

Hi,

I am working with RNA-seq data and normalizing with Cufflinks' Cuffnorm, whose output is count and FPKM matrices. since I'm comparing gene expression between samples, I wanted to use TPM rather than FPKM. I was wondering about few issues: 1- Can I calculate TPM values directly from the FPKM values of Cuffnorm? 2- Can I calculate TPM from Cuffnorm's count table? Is it raw enough? 3- What is unique in Cuffnorm's FPKM vs. regular FPKM?

Thanks!

RNA-Seq cuffnorm TPM FPKM cufflinks • 4.5k views

ADD COMMENT • link updated 6.9 years ago by Matteo Schiavinato ★ 3.6k • written 6.9 years ago by galozs ▴ 20

score 0 · Answer 1 · 2018-01-15

0

Entering edit mode

6.9 years ago

Matteo Schiavinato ★ 3.6k

To calculate TPM from FPKM use the formula proposed by Pachter et al, 2011, here:

$enter image description here$

To fully understand this formula: What the FPKM?

ADD COMMENT • link 6.9 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Thanks! Do you know if it is OK to do it with the FPKM output values of Cuffnorm? I used the formula that you mentioned and got very wired results when calculated correlation matrix. So I thought that maybe Cuffnorm's FPKM are already manipulated in a way...

ADD REPLY • link 6.9 years ago by galozs ▴ 20

0

Entering edit mode

From every software output you compute the FPKM, you should retrieve the same number, if you start from the same counts, otherwise that's some great bug in data science. The sum of the FPKM on the denominator of the fraction is belonging to the sample, not the whole data set. Does that help? Did you multiply by 10 to the power of 6? Many times I forget to do that...

ADD REPLY • link 6.9 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

That what I was thinking though the end values where very strange. (I checked that all the steps were done correctly according to the formula) these are pearson correlations of the log2 FPKM data:

and these are the pearson correlations of the log2 TPM: as you can see the TPM looks realy bad. Do you have an idea? where is my mistake?

ADD REPLY • link 6.9 years ago by galozs ▴ 20

0

Entering edit mode

TPM is not just a rearrangement of FPKM, it contains a sample-specific factor. Therefore, it might mean that your two samples are not that similar after all. However, you should still remove low values (scale the threshold from FPKM=1 to TPM=<val> for each of your samples).

ADD REPLY • link 6.9 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

What do you mean by "scale the threshold from FPKM=1 to TPM=<val> for each of your samples" ? both correlation matrix I attached were calculated separately on the FPKM / TPM log2 values. Do you mean that the noise threshold in is different for each one of the samples? For example if I used log2(FPKM) = 5 as a threshold than in TPM I will have a different one for each sample?

Thanks!

ADD REPLY • link 6.9 years ago by galozs ▴ 20

0

Entering edit mode

Exactly.

Please follow me through the formula that I posted "upstairs":

If you look carefully, you'll see that the fraction includes the FPKM value of gene i as numerator, and the sum of all the FPKM values of sample j as denominator.
This means that, for each sample, you'll have a different sum at the denominator

This being said, to obtain a sample-specific threshold, you will apply 5 as numerator and then calculate the sum of the FPKMs for each sample to divide 5 with. The fraction is then multiplied by 10 to the power of 6, giving you the TPM version of FPKM=5, relativized to the total sample expresssion, for each sample.

ADD REPLY • link 6.9 years ago by Matteo Schiavinato ★ 3.6k