Calculating TPM from featureCounts output

17

Entering edit mode

9.6 years ago

nash.claire ▴ 510

Hi all,

Have a simple question but just want to double check I'm not doing something stupid.

I have paired-end RNA-seq data for which I have used featureCounts to quantify raw counts. I now want to normalize using the TPM formula. I read this blog :-

http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

which simply says, divide read counts by gene length in kilobases to give reads per kilobase (RPK), sum all the RPK values and divide by a million for a per million scaling factor and then divide all RPK values by this scaling factor.

So taking my output from featureCounts which looks something like this : -

Geneid      Chr    Start          End            Strand      Length   sample.bam
NM_032291   chr1   66999639 etc   67000051 etc   + etc etc   10934    25

I use the values in the "Length" column as reads per kilobase or do I have to convert this to per kilobase first? It didn't say in the Subread manual much about this length value. And the value in my "sample.bam" column is definitely the read count value I need?

RNA-Seq • 67k views

ADD COMMENT • link updated 22 months ago by camillab. ▴ 160 • written 9.6 years ago by nash.claire ▴ 510

0

Entering edit mode

The paper to cite seems to be Bo Li et al. 2010: https://academic.oup.com/bioinformatics/article/26/4/493/243395 - see page 494 eq. 2, though it references a paper within, and the RNAseq review paper Conesa et al. 2016 references Pachter 2011: https://arxiv.org/pdf/1104.3889.pdf.

ADD REPLY • link 6.5 years ago by janzrimec • 0

23

Entering edit mode

9.6 years ago

Kamil ★ 2.3k

Consider using the tximport R package:

https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html

Below, I'm sharing a function I have used to convert counts to TPM.

See here for a script that gets the number of coding bases per gene.

See here for a script that gets the mean fragment length for each library.

	#' Convert counts to transcripts per million (TPM).
	#'
	#' Convert a numeric matrix of features (rows) and conditions (columns) with
	#' raw feature counts to transcripts per million.
	#'
	#' Lior Pachter. Models for transcript quantification from RNA-Seq.
	#' arXiv:1104.3889v2
	#'
	#' Wagner, et al. Measurement of mRNA abundance using RNA-seq data:
	#' RPKM measure is inconsistent among samples. Theory Biosci. 24 July 2012.
	#' doi:10.1007/s12064-012-0162-3
	#'
	#' @param counts A numeric matrix of raw feature counts i.e.
	#' fragments assigned to each gene.
	#' @param featureLength A numeric vector with feature lengths.
	#' @param meanFragmentLength A numeric vector with mean fragment lengths.
	#' @return tpm A numeric matrix normalized by library size and feature length.
	counts_to_tpm <- function(counts, featureLength, meanFragmentLength) {

	# Ensure valid arguments.
	stopifnot(length(featureLength) == nrow(counts))
	stopifnot(length(meanFragmentLength) == ncol(counts))

	# Compute effective lengths of features in each library.
	effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) {
	featureLength - meanFragmentLength[i] + 1
	}))

	# Exclude genes with length less than the mean fragment length.
	idx <- apply(effLen, 1, function(x) min(x) > 1)
	counts <- counts[idx,]
	effLen <- effLen[idx,]
	featureLength <- featureLength[idx]

	# Process one column at a time.
	tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
	rate = log(counts[,i]) - log(effLen[,i])
	denom = log(sum(exp(rate)))
	exp(rate - denom + log(1e6))
	}))

	# Copy the row and column names from the original matrix.
	colnames(tpm) <- colnames(counts)
	rownames(tpm) <- rownames(counts)
	return(tpm)
	}

view raw counts_to_tpm.R hosted with ❤ by GitHub

ADD COMMENT • link 7.8 years ago by Kamil ★ 2.3k

1

Entering edit mode

What's the meaning of "meanFragmentLength", could you give an example or more description?

ADD REPLY • link 6.5 years ago by Shixiang ▴ 100

1

Entering edit mode

Fragment Length usually corresponds to the size of the DNA or RNA fragments that have been sequenced. Such information is usually obtained with a Bioanalyzer. It can also be infered from the distance between pairs when performing paired-end sequencing.

ADD REPLY • link 6.5 years ago by Gautier Richard ▴ 370

0

Entering edit mode

Simple and clean, thanks for the function.

As a suggestion, shouldn't there be a parameter to keep feature names? As in:

counts_to_tpm <- function(counts, featureLength, meanFragmentLength, featureName)
# Ensure valid arguments.
stopifnot(length(featureLength) == nrow(counts))
stopifnot(length(meanFragmentLength) == ncol(counts))

# Compute effective lengths of features in each library.
 effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) {
 featureLength - meanFragmentLength[i] + 1
}))

# Exclude genes with length less than the mean fragment length.
idx <- apply(effLen, 1, function(x) min(x) > 1)
counts <- counts[idx,]
effLen <- effLen[idx,]
featureLength <- featureLength[idx]
###  featureNameAdjusted <- featureName[idx]

# Process one column at a time.
tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
rate = log(counts[,i]) - log(effLen[,i])
denom = log(sum(exp(rate)))
exp(rate - denom + log(1e6))
}))

# Copy the column names from the original matrix add feature names to rownames (or extra column?)
colnames(tpm) <- colnames(counts)
###rownames(tpm) <- featureNameAdjusted
return(tpm)
}

ADD REPLY • link 8.7 years ago by Fernando Rossello • 0

1

Entering edit mode

Thanks for the suggestion! I modified the function for you.

ADD REPLY • link 8.7 years ago by Kamil ★ 2.3k

0

Entering edit mode

Great Kamil! Thanks.

ADD REPLY • link 8.7 years ago by Fernando Rossello • 0

0

Entering edit mode

Hi so I am confused I want to calculate the TPM from features counts and you wrote a very nice script but do I have to calculate the mean fragment length necessary to get the TPM or can I just run instead:

tpm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(rate) * 1e6
}

ADD REPLY • link 22 months ago by camillab. ▴ 160

0

Entering edit mode

Hi! From the link to the program for calculating fragment length I don't seem to find that exact metric. Instead, I found the read length and the insert length, do you use here insert as synonym of fragment? I understand that ideally only the insert is aligned and in that case they can be the same, but what happens if part of the adapter is also involved? I am a little confused about this, and why TPM accounts for the fragment in stead of the insert which is the actual DNA of interest... Can you give me a hand on this? Thanks!

ADD REPLY • link 6.2 years ago by dganiewich ▴ 130

7

Entering edit mode

9.6 years ago

Rob 7.1k

First, let me suggest that you'd probably be better off using a tool for explicitly estimating relative abundance than processing the output of a tool like featureCounts (see e.g. this recent manuscript by Soneson et al.). To answer your question, that's a very round-about way of computing TPM, which seems to introduce some arbitrary scaling factors for no real reason (e.g. why compute RPK rather than just reads per base?). So, anyway, you can compute TPM in the following manner. Let N_i be the number of reads coming from gene i and L_i be the length of gene i. Then TPM_i is given by the following formula:

TPM_i = 10^6 * (N_i / L_i) / [Σ_j (N_j / L_j)]

That is, you first compute the rate of fragments per base for each gene (given by N_i / L_i for gene i), then you divide by the sum of this quantity for all genes --- this normalizes the measurements so they sum to 1. Now, you have a measurement with a non-negative value for each gene that sums to 1 (is a partition of unity); this is called the "transcript fraction" (or, in your case, "gene fraction"). You simply multiply this by 1 million to convert it to "transcripts per million" (TPM).

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Rob 7.1k

0

Entering edit mode

Thank you for this explanation of how to calculate TPM. I was previously using the method described on the website linked in the question above - it gave me quite a different answer. I was wondering - is there a publication which lists the method to calculate TPM in the way you have in your answer? I've had a look but keep finding different methods to calculate it. Thanks for letting me know.

ADD REPLY • link 6.5 years ago by CC ▴ 50

0

Entering edit mode

It looks like CPM is computed here.

TPM for a gene should be computed using the TPMs computed from transcripts.

It looks like a big confusion is going on regarding CPM, TPM for genes, and TPM for transcripts.

ADD REPLY • link 6.2 years ago by enxxx23 ▴ 310

0

Entering edit mode

2.1 years ago

gernophil ▴ 130

Is there any way, how I can calculate TPM from count data from a publicly available dataset, if I don't have the meanFragmentLength? I can't find it anywhere.

ADD COMMENT • link 2.1 years ago by gernophil ▴ 130

Login before adding your answer.