I have output of edger with log(CPM) value for comparison between e.g. treatment A vs control.
I would like to know,
- can I use this value for selecting high expressed gene (for qPCR validation). ?
- can use this value as measurement of expression in treatment A?
Or I should get a separate normalized count table for treatment A?
From what I understand the CPM values from edgeR are normalized for read depth and library composition via the TMM method but not for gene length. Therefore long(er) genes will inherently have higher CPM values than shorter ones which makes it questionable to select genes purely based on CPM.
Instead as this is not for statistical analysis, maybe use a method such as TPM or R/FPKM as this corrects for gene length and therefore somewhat compensates for the length/count dependency. I assume this is normal RNA-seq with fragmentated RNA and not single-cell or other 3'end-based RNA-seq methods?
Still, why would you focus on highly-expressed genes for validation? Would'nt that induce a kind of bias as highly-expressed genes should have greater statistical power and (thinking aloud) will be more reliable to be true positive? Maybe a combination of highly and moderately expressed genes or randomly-chosen but significant genes will be more informative to assess the false-positive rate?
But can I use log(cpm) as a scale?
Or it is fine for understanding high expressed gene in comparison and if I want high expressed gene in Treatment A , it is better to have a separated normalized count table?
You can of course use log(CPM+1) (+1 to avoid log of 0=Inf) but as I said above it will bias your result towards longer genes which will inherently have higher counts.
It is easy to convert raw counts to TPM given a count table and a list of gene lengths, see here.
Thank for reply! Yes it is a normal RNA-seq.
You right, I guess random one is better.
But can I use log(cpm) as a scale? Or it is fine for understanding high expressed gene in comparison and if I want high expressed gene in Treatment A , it is better to have a separated normalized count table?
You can of course use log(CPM+1) (+1 to avoid log of 0=Inf) but as I said above it will bias your result towards longer genes which will inherently have higher counts.
It is easy to convert raw counts to TPM given a count table and a list of gene lengths, see here.