Hello,
I'm new-bie to RNA-Seq and I'm trying to set up RNA-Seq analysis pipeline. Now, I've got stuck on calculating FPKM from HTSeq counts table.
I've read some articles about FPKM like this and questions on Biostars. But, I' not sure that did I correctly understand because can not reproduce FPKM with public data.
I tried calculation in R and used data from GDC data portal, 551042d1-2551-46bb-a56c-3cc79ed09469.htseq.counts and 551042d1-2551-46bb-a56c-3cc79ed09469.FPKM.txt as an example. (you can get it here)
count <- read.table("551042d1-2551-46bb-a56c-3cc79ed09469.htseq.counts",
stringsAsFactors = FALSE,
col.names = c("id", "counts"))
count.fpkm <- read.table("551042d1-2551-46bb-a56c-3cc79ed09469.FPKM.txt",
stringsAsFactors = FALSE,
col.names = c("id", "fpkm"))
Gene lengths are extracted with scripts from this article.
mrg <- merge(count, exonic.gene.sizes, by = "id")
head(mrg)
id counts size
1 ENSG00000000003.13 1407 12883
2 ENSG00000000005.5 1 15084
3 ENSG00000000419.11 2280 23689
4 ENSG00000000457.12 1525 44637
5 ENSG00000000460.15 992 192074
6 ENSG00000000938.11 1237 23214
N <- sum(mrg$counts)
mrg.fpkm <- transform(mrg, fpkm = counts / size / N * 10^9)
head(mrg.fpkm)
id counts size fpkm
1 ENSG00000000003.13 1407 12883 1.522140655
2 ENSG00000000005.5 1 15084 0.000923977
3 ENSG00000000419.11 2280 23689 1.341423203
4 ENSG00000000457.12 1525 44637 0.476159595
5 ENSG00000000460.15 992 192074 0.071981482
6 ENSG00000000938.11 1237 23214 0.742672623
plot(x = mrg.fpkm$fpkm, y = count.fpkm$fpkm)
The result is totally different from what I expected. What's wrong with my calculation or what's wrong in my understanding?
FPKM = (1e+06*Readcount_Gene) / (Gene_Length_in_kb * total_read_depth)
So basically the read count per gene, divided by its length and the total read count. This results in a tiny number. Hence, one multiplies with 1mio to make the number more intuitive to use. Still, FPKM is no longer considered a good choice for RNA-seq. Rather than that TPM seems to be more reasonable (google for Lior Pachters blog together with "TPM" for some details on that).
Some suggestions: First, use a published package for RNA-seq analysis, such as edgeR, DESeq, kallisto-sleuth, etc. These tools will do the calculation for you, including proper statistics and FDR corrections. Second, when plotting data of this kind, use log2-transformed values to avoid excessive axis limits, e.g. log2(readcount+1), +1 as log2 of 0 is undefined.
Thanks for your suggestions, and you were right. I found the relationship between fpkm value from data portal and the I calculated. So from now, based on what I understood, I'm going to learn about TPM and other pre-built packages, as you suggested. Thanks.
Simply logging something may make it look nice but it does not imply that the distribution is reflective of the underlying 'biology' being measured. Please read:
An update (6th October 2018):
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units
Video 1: (seriously consider spending 45 minutes to watch this, it's extremely educative)
Reading 1: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/
Reading 2: http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
Reading 3: https://liorpachter.wordpress.com/2014/04/30/estimating-number-of-transcripts-from-rna-seq-measurements-and-why-i-believe-in-paywall/
Thank you for information.!