Question

Calculating FPKM from HTSeq counts table

1

Entering edit mode

7.4 years ago

ann081993 ▴ 10

Hello,

I'm new-bie to RNA-Seq and I'm trying to set up RNA-Seq analysis pipeline. Now, I've got stuck on calculating FPKM from HTSeq counts table.

I've read some articles about FPKM like this and questions on Biostars. But, I' not sure that did I correctly understand because can not reproduce FPKM with public data.

I tried calculation in R and used data from GDC data portal, 551042d1-2551-46bb-a56c-3cc79ed09469.htseq.counts and 551042d1-2551-46bb-a56c-3cc79ed09469.FPKM.txt as an example. (you can get it here)

count <- read.table("551042d1-2551-46bb-a56c-3cc79ed09469.htseq.counts",
                stringsAsFactors = FALSE,
                col.names = c("id", "counts"))
count.fpkm <- read.table("551042d1-2551-46bb-a56c-3cc79ed09469.FPKM.txt",
                     stringsAsFactors = FALSE,
                     col.names = c("id", "fpkm"))

Gene lengths are extracted with scripts from this article.

mrg <- merge(count, exonic.gene.sizes, by = "id") 
head(mrg)
                  id counts   size
1 ENSG00000000003.13   1407  12883
2  ENSG00000000005.5      1  15084
3 ENSG00000000419.11   2280  23689
4 ENSG00000000457.12   1525  44637
5 ENSG00000000460.15    992 192074
6 ENSG00000000938.11   1237  23214

N <- sum(mrg$counts)
mrg.fpkm <- transform(mrg, fpkm = counts / size / N * 10^9)

head(mrg.fpkm)
                  id counts   size        fpkm
1 ENSG00000000003.13   1407  12883 1.522140655
2  ENSG00000000005.5      1  15084 0.000923977
3 ENSG00000000419.11   2280  23689 1.341423203
4 ENSG00000000457.12   1525  44637 0.476159595
5 ENSG00000000460.15    992 192074 0.071981482
6 ENSG00000000938.11   1237  23214 0.742672623

plot(x = mrg.fpkm$fpkm, y = count.fpkm$fpkm)

The result is totally different from what I expected. What's wrong with my calculation or what's wrong in my understanding?

RNA-Seq R HTSeq fpkm • 8.7k views

ADD COMMENT • link 7.4 years ago by ann081993 ▴ 10

2

Entering edit mode

FPKM = (1e+06*Readcount_Gene) / (Gene_Length_in_kb * total_read_depth)

So basically the read count per gene, divided by its length and the total read count. This results in a tiny number. Hence, one multiplies with 1mio to make the number more intuitive to use. Still, FPKM is no longer considered a good choice for RNA-seq. Rather than that TPM seems to be more reasonable (google for Lior Pachters blog together with "TPM" for some details on that).

Some suggestions: First, use a published package for RNA-seq analysis, such as edgeR, DESeq, kallisto-sleuth, etc. These tools will do the calculation for you, including proper statistics and FDR corrections. Second, when plotting data of this kind, use log2-transformed values to avoid excessive axis limits, e.g. log2(readcount+1), +1 as log2 of 0 is undefined.

ADD REPLY • link 7.4 years ago by ATpoint 86k

0

Entering edit mode

Thanks for your suggestions, and you were right. I found the relationship between fpkm value from data portal and the I calculated. So from now, based on what I understood, I'm going to learn about TPM and other pre-built packages, as you suggested. Thanks. enter image description here

ADD REPLY • link 7.4 years ago by ann081993 ▴ 10

0

Entering edit mode

Simply logging something may make it look nice but it does not imply that the distribution is reflective of the underlying 'biology' being measured. Please read:

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLY • link 6.3 years ago by Kevin Blighe 88k

1

Entering edit mode

Video 1: (seriously consider spending 45 minutes to watch this, it's extremely educative)

Reading 1: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

Reading 2: http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

Reading 3: https://liorpachter.wordpress.com/2014/04/30/estimating-number-of-transcripts-from-rna-seq-measurements-and-why-i-believe-in-paywall/