Which Expression Units To Use, Fpkm Or Rpkm ?
3
22
Entering edit mode
11.7 years ago
biorepine ★ 1.5k

Dear Biostars! I think this is one of the common problems (which expression units to use, FPKM or RPKM) in RNA-Seq expression analysis. People who use cufflinks end up with FPKM and ERANGE with RPKM. Cufflinks has nice explanation why FPKM save us from the skewed expression values called by other softwares especially with paired-end read data....

They're almost the same thing. RPKM stands for Reads Per Kilobase of transcript per Million mapped reads. FPKM stands for Fragments Per Kilobase of transcript per Million mapped reads. In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. Paired-end RNA-Seq experiments produce two reads per fragment, but that doesn't necessarily mean that both reads will be mappable. For example, the second read is of poor quality. If we were to count reads rather than fragments, we might double-count some fragments but not others, leading to a skewed expression value. Thus, FPKM is calculated by counting fragments, not reads.

However, after analyzing around 10 tissues paired end, long, polyA+, RNA-Seq datasets (after mapping them with TopHat and Bowtie), I noticed that same genes that have expression of FPKM between >0 and <1 have ~200 RPKM. I think this difference could cause serious problems in defining accurate expression units and defining the number of expressed or up-regulated or down-regulated..

I would appreciate if any answer or comment on using RPKM over FPKM or vice versa ? Gracias! :)

fpkm rpkm rna-seq • 116k views
ADD COMMENT
0
Entering edit mode

Just to make sure - if I have paired and reads, then one read can be mapped an other not and in this case I will count it as one fragment? And if both reads are mapped, I will also count it as one fragment? (Otherwise I do not understand how we could double-count some fragments when counting raw reads). Thank you very much for explanation.

ADD REPLY
2
Entering edit mode

Use neither.

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLY
0
Entering edit mode

So what should be used?

ADD REPLY
1
Entering edit mode

You could normalise your raw counts using edgeR or DESeq2. If you need to export data for downstream analyses, my preference is always the regularised log or variance-stabilised expression values from DESeq2.

ADD REPLY
0
10
Entering edit mode
11.7 years ago
seidel 11k

I think FPKM is the conceptually cleaner way to go, and thus is the preferred term. The rationale is that one is inferring expression level of a gene (concentration of a transcript) based on observations of a fragment from that transcript. Whether the presence of that fragment is quantified from 1 read, or 2 reads, is simply a technical concern, outside of the unit definition. Granted, you indicated a result where software reports different values on a data set for the two different units, but I would argue that's because of messy implementation. A read is evidence of a fragment, 2 paired-end reads are evidence of a fragment. Evidence of a fragment is used to count transcripts. Since both infer fragment counts, I think FPKM is the more general and appropriate term. (that's my opinion - though I'm not sure it helps your particular quandry).

ADD COMMENT
0
Entering edit mode

If this is the case, I think recent ENCODE paper(http://www.nature.com/nature/journal/v489/n7414/abs/nature11233.html) used RPKM instead of FPKM for their paired-ends RNA-Seq data and I guess the main reason they found most of the transcriptome is expressed because they used RPKMs instead FPKMs. Uffff! What the helll!

ADD REPLY
8
Entering edit mode
11.7 years ago
matted 7.8k

I think there's some confusion in the question and comments here. FPKM are the "fancy" units that cufflinks uses specifically to report its probabilistic estimates of isoform abundances. They don't have direct mappings from individual reads, though of course they are estimated from the read data. The f instead of r is to unify the terminology to data from paired (and higher order) reads.

For more on this topic see Meaning Of Fpkm Value Used By Cufflinks and here.

So to me, "should I use FPKM" is more accurately "should I use Cufflinks." RPKM would typically be used by a more "direct" analysis that maps reads to specific single exons and yields an exon-level analysis, rather than a more complicated isoform-level analysis with advanced statistical techniques.

With that said, differences between FPKM and RPKM are most likely due to the complicated procedure the cufflinks follows to estimate isoform abundance, rather than any paired vs. single counting issue.

Furthermore, I don't think the FPKM vs. RPKM question has any direct bearing to the ENCODE results, as suggested in a comment above.

ADD COMMENT
3
Entering edit mode

You're mixing stuff here. The quoted cufflinks explanation in the original question explains very clearly what FPKMs are. This unit is not specific to Cufflinks, and can be easily calculated manually for genes. It is meant to correct a small glitch in the RPKM calculation when using paired-end reads. This is explained in this video of a talk by Lior Pachter ( at 34:17)

What is specific to Cufflinks is that it gives FPKM measurements at transcript level. To do so it uses a complex methodology to deconvolute the reads mapping to a given gene model into the expression levels of all of its transcripts. FPKM is merely the unit that the authors chose to report their deconvoluted expression values

I hope this clarifies things. In summary "should I use FPKM" is not the same as "should I use Cufflinks"

ADD REPLY
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6