Get TPM from RNA counts and gene length?
2
2
Entering edit mode
21 months ago
cthangav ▴ 110

Hello, I am working with an RNA-seq FeatureCounts output file that supplies the counts for a given ENSG gene ID, as well as the gene length(according to documentation this is in base pairs, not kilobases). Is there a way to obtain approximate TPMs/FPKMs or something similar to TPM's using these columns?

geneID  geneLength  read counts
ENSG00000121410 4006    3
ENSG00000268895 2793    3
ENSG00000148584 9603    11
ENSG00000175899 6384    1

I've read that it's possible to do the following to get TPMs:

  1. Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
  2. Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
  3. Divide the RPK values by the “per million” scaling factor. This gives you TPM.

Alternatively, it seems that some people use the rpkm function EdgeR to do this (however I need TPMs or FPKM's specifically). What would be the best way to calculate TPMs?

EdgeR RNA counts TPM • 7.2k views
ADD COMMENT
1
Entering edit mode

Best way to get TPMs is to use RSEM instead of featureCounts.

The notion of gene length is kinda iffy. Does it mean the length of the entire gene (introns+exons)? Does it mean the lengths of exons? What if one transcript of a gene is expressed very highly while a shorter one of the same gene is not expressed at all -- do we account for that?

ADD REPLY
0
Entering edit mode

I'm not sure how those lengths are determined. Ensembl doesn't think any transcripts of that first gene are 4kb long.

ADD REPLY
0
Entering edit mode

It's gene length, not transcript length. It is the total number of bases covered by any annotated exon for that gene. It merges all the annotated transcripts.

ADD REPLY
0
Entering edit mode

But would you want to use that value for TPM? Or anything?

ADD REPLY
1
Entering edit mode

It works well for RPKM or FPKM. It is the appropriate length measure here because it matches how the genewise read counts are computed.

ADD REPLY
2
Entering edit mode

What must be clear to everyone using RPKM is that simple summing annotated exons is a comparably naive way of getting the gene length as it does not respect in which proportions isoforms are being expressed. Hence, if you want to make statements on expression levels with RPKM this is not accurate. The concern is not theoretical, for example the transcription factor CEBPa (a known driver in hematopoietic malignency, and the isoforms having known and different roles in it) expresses two isoforms with ~25% difference in tx length. It then makes quite a difference whether conditionA expresses 95% isoform-1 and 5% isoform-B but conditionB expresses a 10/90% ratio. featureCounts (please correct if wrong) in my understanding will output the same gene length for all conditions. If this unequal ratio is nested by experimental conditions then RPKMs on naive exon sum is not accurate and one needs dedicated approaches like salmon or rsem to estimate effective length based on transcript expression composition. I would therefore treat featureCounts length output with care if @OP you really want to use that. Whether this affects the big picture is questionable and would need to be checked, it definitely can affect individual genes.

ADD REPLY
2
Entering edit mode

Yes, I am aware of the approximations made by FPKM. (I've been aware of them for 15 years, although people do love to explain it to me again.)

IMO FPKM will probably work fine for OP, just as it did for the original authors of the PECA method that OP is attempting to reproduce. The PECA method is from Wing Wong's group, which is a very strong statistical bioinformatics Lab at Stanford.

This is not the time for a long discussion, but I do wish that people who answer questions on Biostars would treat TPM with a similar level of scepticism. In my experience, FPKM seldom gives a misleading idea of expression changes for more than 1-2 genes per dataset. On the other hand, transcript TPM methods like RSEM or Salmon give TPMs that are wildly wrong for tens of thousands of transcripts for almost every dataset. That is not a criticism of those methods, just an observation on what can be reliably measured from available data.

ADD REPLY
1
Entering edit mode

Dear Professor Smyth,

Can you show me referenced benchmarks that FPKM is misleading for "1-2 genes per dataset" while TPMs "are wildly wrong for tens of thousands of transcripts for almost every dataset".

There are many papers (w/ benchmarks) arguing the limitations of FPKMs and the union-exon approach to gene length (and why TPM is preferable) -- Zhao et al., 2015, the tximport paper, Trapnell et. al 2012, and the blog post (w/ included CSHL talk) from the person who first formulated FPKM which, as you are aware, is a "statistical bioinformatics lab" at Berkeley (now Caltech).

ADD REPLY
0
Entering edit mode

Hello Ram and ATPoint, thank you for helping me repeatedly with my questions. I am now finalizing the paper, and if there is a way to acknowledge you, I would be happy to do so. Thanks again!

ADD REPLY
6
Entering edit mode
21 months ago
Gordon Smyth ★ 7.7k

Assuming your data are paired-end, then FPKMs are what edgeR::rpkm() produces. You seem to be saying that FPKMs are a possible solution for you, so this is most likely what you need. This is the standard and most widely used method of adjusting for gene length when working with RNA-seq data quantifed at the gene level, which is what you have.

TPMs are something else. In the context of a gene-level analysis as you are doing, TPMs are not really a meaningful concept and, worse, they are actively bad because they destroy the TMM normalization that edgeR has presumably so carefully done. It is hard to think of any downstream analysis that you might want to do for which TPMs would be required.

ADD COMMENT
0
Entering edit mode

I tried using edgeR::rpkm() on the feature counts file like this post (unfortunately the link you supply there to your RNA case study isn't working for me)

#read in input feature counts file with ENSGID, counts, and gene length columns
file <- read.delim("D0-Fibroblast_rep1 GSM4553868_19-03967_CCTAAGAC.txt", header = TRUE, sep = "\t", row.names = 1)

#create DGEList object containing the counts and gene IDs as row names
counts <- DGEList(counts=file$X19.03967_CCTAAGAC, genes=rownames(file))

#Calculate scaling factors to convert raw library sizes into effective library sizes
counts <- calcNormFactors(counts)

#calculate rpkm from counts in DGEList object
rpkm <- rpkm(counts, gene.length=file$geneLength)

#add geneIDs as rownames to rpkm object
rownames(rpkm) <- rownames(file)

Is this the correct way to use the rpkm function? And would the output essentially be FPKMs?

For context, downstream I'm trying to use this RNA data for Paired Chromatin Expression Analysis which states it needs FPKM/TPM values. Thank you for your help.

ADD REPLY
0
Entering edit mode

For reference, the DGEList object looks like this:

> head(counts, n=10)
An object of class "DGEList"
$counts
   Sample1
1        1
2        0
3        0
4        0
5        0
6        0
7        0
8        0
9       20
10       0

$samples
        group lib.size norm.factors
Sample1     1  2368446            1

$genes
             genes
1  ENSG00000121410
2  ENSG00000268895
3  ENSG00000148584
4  ENSG00000175899
5  ENSG00000166535
6  ENSG00000256661
7  ENSG00000256904
8  ENSG00000256069
9  ENSG00000128274
10 ENSG00000118017

The rpkm object I get looks like this:

> head(rpkm, n=20)
              Sample1
ENSG00000121410 0.1053963
ENSG00000268895 0.0000000
ENSG00000148584 0.0000000
ENSG00000175899 0.0000000
ENSG00000166535 0.0000000
ENSG00000256661 0.0000000
ENSG00000256904 0.0000000
ENSG00000256069 0.0000000
ENSG00000128274 2.8693019
ENSG00000118017 0.0000000
ENSG00000094914 5.9179872
ENSG00000081760 0.4986101
ENSG00000250420 0.0000000
ENSG00000114771 0.0000000
ENSG00000197953 0.0000000
ENSG00000188984 0.0000000
ENSG00000204518 0.0000000
ENSG00000109576 0.4249089
ENSG00000103591 8.9224210
ENSG00000115977 0.7637599
ADD REPLY
0
Entering edit mode

Is this the correct way to use the rpkm function?

If file$geneLength contains the gene lengths from featureCounts, then yes. Just read the help page ?rpkm.

would the output essentially be FPKMs?

The output is FPKM, exactly what was used in the original PECA papers.

ADD REPLY
1
Entering edit mode
21 months ago

The T means the total length of the transcript, so you would need to divide by the sum of the transcripts and 1 million to get TPM, long story short consult this:

https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

ADD COMMENT

Login before adding your answer.

Traffic: 1642 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6