Using transcripts per million (TPM)

13

Entering edit mode

10.2 years ago

Tom Harrop ▴ 170

Hi BioStars,

I have two questions about using TPM (transcripts per million). I've read some papers on the calculation and some blog and forum posts so I have some understanding of what it is. The true analysis for this experiment was with raw counts and vst expression values, and I'm basically just having a look at TPM out of interest.

My questions:

1. Is it valid to calculate TPM from DESeq2's normalised counts, i.e. counts(dds, normalized = TRUE), or do I have to use the raw, raw counts? I tried both and there didn't seem to be a great deal of difference (actually my TPM results aren't that different to using normalized raw counts for the genes I've looked at, in either case) but I haven't tested it thoroughly.

2. I understand why one shouldn't compare TPM between samples, since the total expression rates, rRNA component etc. varies sample-to-sample. I'm just wondering if this would be less of a problem in the case where data from three biological replicates were available?

Thanks for reading and have a nice Friday,

Tom

RNA-Seq deseq2 tpm • 48k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.2 years ago by Tom Harrop ▴ 170

9

Entering edit mode

10.2 years ago

karl.stamm 4.1k

For question 1) TPM is not readcount. Normalized readcount is for scaling the sample sequencing depth, and TPM is about transcripts, completely inferred by an advanced model where long genes will get more reads, and using spliced reads to infer isoform usage. In that way it's like Tophat's Cuffnorm for FPKM. The only tool I know that makes TPM is RSEM.

For question 2) comparing different kinds of samples will suffer bias if the distribution of mRNAs is very different, but biological replicates are as close as possible, so that IS the appropriate place to compare values.

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by karl.stamm 4.1k

1

Entering edit mode

Hi Karl,

Thanks for the reply.

I'm sorry if my first question wasn't clear. I realise TPM is not read count—I manually calculated TPM from normalised read count (and, separately, from raw read count) using the gene lengths from my GTF file. I don't know whether it's valid to use the normalised counts instead of the raw counts in the TPM calculation.

Tom

ADD REPLY • link 10.2 years ago by Tom Harrop ▴ 170

2

Entering edit mode

Hi,

Could you please tell me which is the formula that you use to manually calculate TPM?

I'm getting a little bit confused since I'm trying to find an "unambiguous" one and I found these 3 links, that don't say exactly the same thing.

	# Script to compare Reads per Kilobase per Million mapped reads (RPKM) to Transcripts per Million (TPM) for gene expression count data
	# Wagner et al. 2012 "Measurement of mRNA abundance using RNA-seq data: RPKM measure
	# is inconsistent among samples" Theory Biosci. 131:281-285

	library(plyr)


	## Worked example from http://blog.nextgenetics.net/?e=51

	X <- data.frame(gene=c("A","B","C","D","E"), count=c(80, 10, 6, 3, 1),
	length=c(100, 50, 25, 5, 1))
	X

	Y <- data.frame(gene=c("F","G","H","I","J"), count=c(20, 20, 10, 50, 400),
	length=c(100, 50, 25, 5, 1))
	Y



	## Calculate RPKM

	# RPKM = (Rg * 10^6) / (T * Lg)
	# where
	# Rg: number of reads mapped to a particular transcript g = count
	# T = total number of transcripts sampled in run
	# FLg: length of transcript g (kilobases)

	RPKM <- function(Rg, Lg, T) {
	rpkm <- (Rg * 1e6)/(T * Lg)
	return(rpkm)
	}

	T <- sum(X$count)

	RPKM(Rg=X$count[1],Lg=X$length[1],T=T)
	RPKM(Rg=X$count[2],Lg=X$length[2],T=T)
	RPKM(Rg=X$count[3],Lg=X$length[3],T=T)

	# Calculate RPKM using ddply
	rpkm.X<-ddply(X, .(gene), summarize, rpkm = (count1e6)/((sum(X$count)length)))
	rpkm.X
	mean(rpkm.X$rpkm)

	rpkm.Y<-ddply(Y, .(gene), summarize, rpkm = (count1e6)/((sum(Y$count)length)))
	rpkm.Y
	mean(rpkm.Y$rpkm)

	## Calculate TPM

	# TPM = (Rg * 10^6) / (Tn * Lg)
	# where
	# Tn = sum of all length normalized transcript counts

	(Tn.X <- sum(ddply(X, .(gene), summarize, Tn = count/length)[2]))

	TPM <- function(Rg, Lg, Tn) {
	tpm <- (Rg * 1e6)/(Tn * Lg)
	return(tpm)
	}

	TPM(Rg=X$count[1],Lg=X$length[1],Tn=Tn.X)
	TPM(Rg=X$count[2],Lg=X$length[2],Tn=Tn.X)
	TPM(Rg=X$count[3],Lg=X$length[3],Tn=Tn.X)

	# Great - corresonds to example results!

	# Calculate RPKM using ddply
	tpm.X <- ddply(X, .(gene), summarize, tpm = (count1e6)/(Tn.Xlength))
	tpm.X
	mean(tpm.X$tpm)


	(Tn.Y <- sum(ddply(Y, .(gene), summarize, Tn = count/length)[2]))

	tpm.Y <- ddply(Y, .(gene), summarize, tpm = (count1e6)/(Tn.Ylength))
	tpm.Y
	mean(tpm.Y$tpm)

view raw RPKM-TPM.r hosted with ❤ by GitHub

http://lynchlab.uchicago.edu/publications/Wagner,%20Kin,%20and%20Lynch%20%282012%29.pdf

https://www.biostat.wisc.edu/bmi776/lectures/rnaseq.pdf

I used RSEM to calculate expression, but I need a TPM estimate for a gene that I can't take from the RSEM output (don't ask, it's complicated :) )

in particular, using the formula from the Dewey presentation, (10^6 * Z * ( C_i/ L'_i * N) ), I'm trying to understand what exactly Z stands for. it should be a normalization parameter so it should to be the same for all the transcripts (am I right?), but when I try to extrapolate its value from the TPM values of the RSEM output (basically Z= TPM_value / (10^6 *c_i / L_i * N) I get different results for Z (the values oscillate a little bit around a constant number).

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 10.0 years ago by biola ▴ 20

1

Entering edit mode

Hi, the formula I used in R was lifted from here (it's the same as the Wagner paper).

ADD REPLY • link 10.0 years ago by Tom Harrop ▴ 170

0

Entering edit mode

Thanks @biola. I need to normalize my Htseq-Count data based on TPM. I read your code but in my case, I have 20000 genes(rows) and 259 columns(samples). how to apply your TPM function for that matrix?

ADD REPLY • link 5.4 years ago by modarzi ▴ 170

0

Entering edit mode

Sorry, if I have extracted a list of differentially expressed genes by edgeR, does this make sense to use Transcripts Per Million (TPM) normalized data for co-expression analysis????? I mean, firstly, I defined DE genes from raw read counts by edgeR but as I had Transcripts Per Million (TPM) file, I extracted DE genes defined by edgeR from Transcripts Per Million (TPM) file and used for network construction.

ADD REPLY • link 7.2 years ago by zizigolu ★ 4.3k

6

Entering edit mode

9.0 years ago

SP ▴ 300

Just for the sake of putting TPM formula in readable format:

TPM = ((tag count for transcript n* read length) / length of transcript n) * 1million / normalizing term

normalizing term = sum((number of tag for transcript n * read length)/ length of transcript n) for all transcripts

For better understanding read RPKM inconsistencies with example and

This might also be useful

ADD COMMENT • link 8.9 years ago by SP ▴ 300

Login before adding your answer.