Question

RNA-seq heatmaps: effective length, RPKM vs. CPM

0

Entering edit mode

9 months ago

bioinfo2345 ▴ 40

Hi,

I am calculating RPKM/FPKM to make a heatmap of differentially expressed genes and have a few questions. I have of course done the differential expression analysis starting with raw counts. This is only about visualization.

Question 1: Should I use length normalization using gene length or effective length? I think I should use effective length, but cannot formulate for myself why this is. Why is it preferable to use effective length?

Question 2: I am using the following code to transform raw counts for visualization only:

data.set.RPKM <- rpkm(y, log=TRUE, prior.count=1, gene.length = y$genes$effective_length)

where y is a DGEList object.

Since I have paired-end reads, can I call this (log2) FPKM directly without doing any conversion?

Question 3: I have tried this visualization with CPM as well. Any reason to prefer one over the other?

edgeR RNA-seq • 602 views

ADD COMMENT • link updated 9 months ago by ATpoint 88k • written 9 months ago by bioinfo2345 ▴ 40

1

Entering edit mode

Here is Lior Patcher's thoughts on FPKM: https://www.reddit.com/r/bioinformatics/comments/25yopp/lior_pachter_who_invented_the_fpkm_unit_for/

ADD REPLY • link 9 months ago by biofalconch ★ 1.3k

score 2 · Answer 1 · 2024-10-22

A heatmap that aims to emphasize differences between samples is usually transformed to Z-score first, and since the Z-score is done across all samples of the same gene the length correction does not matter. I think the difference will be neglectible. I always use CPM since this is what could also be used with testing frameworks such as limma-trend, so for my standard workflows this ensures consistency.

That having said, since the rpkm function from edgeR also uses its normalization factors (which robustifies the per-million normalization that naive rpkm does, given that calcNormFactors has been run) it is just as fine I think.