FPKM vs raw counts vs RPKM
5
15
Entering edit mode
9.9 years ago
NHEJ ▴ 360

Could someone please explain to me (in as many layman's terms as possible for someone new to the RNA-seq field) the fundamental differences between FPKM, counts, and RPKM? I have heard from some bioinformatics colleagues that raw counts (DEseq) are becoming more popular than FPKMs (Cufflinks) to analyze transcriptonomic data, but I am not sure why (or whether this is 100% always true) other than I heard that FPKMs may "over-normalize" too much depending on the experiment. Much of the available published literature on these topics is a bit specialized, so I was wondering if someone could "bring it down to Earth" so to speak on how to understand the differences, pros/cons, and (if possible) special use-cases of when one approach is better to use than another?

counts rpkm raw fpkm RNA-Seq • 45k views
ADD COMMENT
0
Entering edit mode

This workflow helped me a lot getting myself familiar to RNA-seq data analysis. It imports your raw counts and then you can analyze them using different packages.

http://www.bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc/RNA-seqWorkflow.pdf

ADD REPLY
8
Entering edit mode
9.9 years ago
iraun 6.2k

Tophat aligns the reads to the reference genome, and classifies the reads attending to if they have aligned with and without splice junctions:

  • With splice junctions --> anything that jumps regions must span an intron.
  • Without --> anything that maps unspliced must be an exon.

Then Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. Each gene contains one or more transcripts and each transcript has multiple exons, but the transcripts within a given gene share exons and that's why the reads maps probabilistically (does not report read counts). FPKM are the "fancy" units that cufflinks uses specifically to report its probabilistic estimates of isoform abundances.

FPKM vs RPKM: using "F" in place of"R" is only in order to unify the terminology, they switched from "Reads" to "Fragments" to clean up confusion regarding paired end reads.

ADD COMMENT
1
Entering edit mode

+1 for down to Earth answer. But I don't understand your logic when you say "and that's why the reads map probabilistically (does not report read counts)." Could you please expand on this? Are you saying that FPKM is determined probabilisticly according to some sort of algorithm prediction which estimates what exons will be in what transcripts of, say, gene X? I mean what if you don't know what transcripts a gene makes ahead of time, not to mention what combinatorial assembly of exons is in each of these transcripts, how can a program predict any such vast complexity? It seems like a shot in the dark, especially at a large scale like the genome. Thanks in advance for your insights!

ADD REPLY
0
Entering edit mode

I'm also a bit confused in your explanation of splice junctions... You're saying that without splice junctions means anything that maps unspliced must be an exon. But what if the read is from a noncoding part of the genome (this would not map to an exon but could potentially span a splice junction)? Perhaps I am misusing the term noncoding here...

ADD REPLY
1
Entering edit mode

I'll try to explain clearer (sorry, it is quite difficult).

Assuming that we have one exon which is shared between 3 transcripts of the same gene. Since you can not know if that exon is expressed because which of three transcripts, cufflinks can not report counts for a transcript. Instead of that it reports an estimation of the transcript abundances. If you are looking at transcript FPKMs and the gene in question has alternative transcripts, one of the isoforms could get a zero estimate while another (or several others) would get the reads assigned to it/them.

In RNA-Seq experiments, cDNA fragments are sequenced and mapped back to genes and ideally, individual transcripts. Properly normalized, the RNA-Seq fragment counts can be used as a measure of relative abundance of transcripts, and Cufflinks measures transcript abundances in Fragments Per Kilobase of exon per Million fragments mapped (FPKM), which is analagous to single-read "RPKM".

ADD REPLY
0
Entering edit mode

Regarding the second question, what do you mean by "noncoding part of genome"?

ADD REPLY
0
Entering edit mode

By noncoding, I mean in the intronic portions of the genome, such as those that may produce lncRNAs.

ADD REPLY
0
Entering edit mode

Could you please explain how certain transcripts can have FPKM of 0.0000? How does this assignment happen and how does Cufflinks calculate this estimate?

ADD REPLY
0
Entering edit mode

It appears to me Cufflinks does report transcript counts. The ".count_tracking" outputs of Cuffdiff report estimated transcript counts.output description

Cuffdiff estimates the number of fragments that originated from each transcript, primary transcript, and gene in each sample. Primary transcript and gene counts are computed by summing the counts of transcripts in each primary transcript group or gene group. The results are output in count tracking files in the format described

ADD REPLY
5
Entering edit mode
9.9 years ago

RPKM/FPKM are normalised counts. DESeq/edgeR requires raw counts as input as they have their own normalisation methods.

DESeq/edgeR are better for exon/gene expression analysis. Cufflinks is for differential isoform analysis. If you just care about differential genes, go for htseq-count --> EdgeR/DESeq. If you are interested in isoform level analysis, for for programs such as Cufflinks/Cuffdiff packages.

ADD COMMENT
3
Entering edit mode
9.9 years ago

I think the question has pretty much been answered, but I thought it might also be nice to throw in a link to this blog post that I think provides a nice summary:

https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

ADD COMMENT
1
Entering edit mode

Note the line in the piece to which you link:

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

That is, it is not recommended to use RPKM / FPKM for cross-sample differential expression. This is also highlighted in the paper linked by Daniel in his answer.

ADD REPLY
3
Entering edit mode
8.6 years ago
Daniel ★ 4.0k

I know this is an old question, but I was recently reading the paper "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis" which I highly recommend. It includes these two graphs, which I think summarise the issues around different normalisations.

Note: TMM is the method used in edgeR

RNAseq normalisation methods

False Positive Rate

ADD COMMENT
0
Entering edit mode

I thought the paper brought up some interesting points, but I would say these plots are a little confusing for a few reasons:

1) The first figure you show above (Figure 1A in the paper) is for the mouse data, which Table 1 says miRNA-Seq data. If a small-RNA protocol is being used, then I wouldn't expect to use RPKM values over count-per-million (equivalent to total-count, TC, above).

I'm guessing they did this because the scale of values will otherwise be different for RPKM, as shown in Figure S1:

http://bib.oxfordjournals.org/content/suppl/2012/10/02/bbs046.DC1/SuppFig1_Boxplot_log2count.jpg

This is partially a good thing, if you want to know the difference between a short gene with a lot of reads and long gene with a medium level of coverage. Also, the distributions for the human (and non-human, but still RNA-Seq) RPKM values (where you would expect to use see RPKM expression values) are more consistent than the miRNA-Seq RPKM values (where the target gene sizes are roughly similar).

2) The second figure shown above (Figure 2A in the paper) is for simulated data, not the real datasets used previously. While I understand that it would be hard to estimate the false positive rate from those datasets, Table 2 indicates decreased power for the RPKM, RawCount, and TC normalizations, but the gene overlap was always pretty good. This makes sense to me, and it would indicate a decrease in power but not a decrease in false positive rate (although that is the opposite of what the simulated data shows in Figure 2). At least in the DESeq portion of the table, this was also true for the human data in Table S6 (although it brings into question what is due to the normalization versus how the p-value is calculated).

3) The simulated datasets (second figure above) range the differentially expressed genes from 0-30%. I would typically want to identify a few hundred differential expressed genes (so, <5%, in a human or mouse RNA-Seq dataset), so I would probably only pay attention to the first few bars.

4) In absolute terms, all estimated false positive rates for the simulated datasets was less than 0.25, which is not that bad. However, if the first value is the 0% differentially expressed gene group (assuming the bars represent 5% increases in differentially expressed genes), then I don't see how it can have a 5% false positive rate.

ADD REPLY

Login before adding your answer.

Traffic: 1659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6