CPM read count normalization: what does it mean between replicates of same group and within same replicate?
1
1
Entering edit mode
5.9 years ago
salamandra ▴ 550

Hi,

1- In the table ´Normalization method' here says that CPM (counts per million) can be used for gene count comparisons between replicates of the same sample group.

1.1 Does it mean that for eg. we can compare one gene from a sample of group 'control' with same gene of another 'control' sample but that we cannot compare a gene in 'control' sample with same gene in a 'treatment' sample?

1.2 If so, then when looking for a heatmap with CPM values cannot we for e.g. identify genes that seem to have a higher expression in 'treatment' samples than in 'control' samples? Do we need to use a different normalization method?

2- In same table says that CPM cannot be used for within sample comparisons.

2.1 Does it mean we cannot compare different genes of the same sample?

2.2 What if when looking to CPM heatmap it seems one gene is varying more between 'control' and 'treatment' than the other. Can we make this conclusion if heatmap plots CPM values?

RNA-Seq Read count normalization • 17k views
ADD COMMENT
8
Entering edit mode
5.9 years ago
ATpoint 85k

As recommended in this presentation, I would not use per-million methods for anything as there are better methods now. Check this video to get an idea why per-million based methods are not optimal and this one on how the normalization in e.g. DESeq2 works.

Towards your questions:

1 - you can use it but it is not recommended for DE analysis, so better don't use it at all

1.1 - Simply normalize the entire dataset with edgeR or DESeq2 and do comparisons with these values

1.2 - do not use CPM values for a heatmap, use logged/normalized counts, like those produced by the vst or rlog functions in DESeq2. Using non-log counts will bias the heatmap towards highly expressed genes. These video series I inked above also have a video about logs in case you care.

2 - true, because it does not normalize for gene length, so longer genes inherently have higher counts than short genes.

2.1 - one probably could, but not without adjusting for gene length (use the search function on this, there are plenty of posts on that matter already out there).

2.2 - it might give you an idea but you should use appropriate statistics to infer differentially expressed genes.

ADD COMMENT
0
Entering edit mode

Thank you for the answer and the video on DESeq2 normalization

ADD REPLY

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6