Help in Log2 transform CPM count
2
0
Entering edit mode
3 months ago
G.S ▴ 60

Hello,

I have 4 samples and I have tried to convert the count in to log2 but I have some concerns.

I have tried the below code and it works ok

log_count <- log2(countdata + 1)

but I want to normalise the data and consider the difference in library size so I decided to use cpm() function in edgeR package

library(edgeR)
# Create a DGEList object
dge <- DGEList(counts = data)

# Normalize for library size
dge <- calcNormFactors(dge)

# Calculate CPM
cpm_data <- cpm(dge)

# Calculate CPM
cpm_data <- cpm(dge,log=TRUE,prior.count=1)

I have got a negative value in the first 2 groups(highlighted in yellow)

I am not sure how zero converts to -5?is this log2 transformed CPM value correct?

Any clue what the problem is?

Thanks in advance

enter image description here

enter image description here

R edgeR dge Deseq2 • 670 views
ADD COMMENT
1
Entering edit mode

Please use Job only when you're advertising a position. Otherwise, for things that have a specific solution and are not a discussion, use Question.

I've changed your post to a Question now, but please be mindful in the future.

ADD REPLY
1
Entering edit mode
3 months ago
Meisam ▴ 250

I don't have access to your data to reproduce your results, particularly the size factors used for normalization. However, regarding the reason for negative values in log(CPM) versus non-negative values in log(counts), it is important to note that CPM stands for counts 'per million.' This means your counts are divided by a factor of depth/million (independent of size factors), which naturally reduces the log(CPM) values compared to log(counts). In your case, if your log(cpm) ~ -5.1 and log(count) ~0 then I can guess your library depth after size factor normalisation might have been around 2^5.1 ~ 30-40 million.

For a similar case, see https://support.bioconductor.org/p/107719/

ADD COMMENT
0
Entering edit mode

Thanks. Here is my library size :

    group                   lib.size      norm.factors
A549_cells_1         1 35458278    1.0674510
A549_cells_2         1 29380922    1.0484024
A549_cells_3         1 33162651    0.9257543
A549_cells_RBV_1     1 32331807    1.0742032
A549_cells_RBV_2     1 32943146    1.0636958
A549_cells_RBV_3     1 42753921    0.9986421
HRSV_1               1 36926518    0.9647332
HRSV_2               1 37320632    0.9471932
HRSV_3               1 36157090    0.8565001
HRSV_RBV_1           1 28670218    1.0301833
HRSV_RBV_2           1 35463248    1.0435796
HRSV_RBV_3           1 37473681    1.0053129
ADD REPLY
0
Entering edit mode
3 months ago
ATpoint 85k

Negative values arise when taking the log of a value between 0 and 1. I personally dislike this because it (if used for plotting) creates unintuitive results and graphs, so I always set log=FALSE and manually do log2(cpm+1). That is for plotting. For something like PCA or clustering the log=TRUE is fine.

As for your particular data, from the ?cpm documentation details:

If log-values are computed, then a small count, given by prior.count but scaled to be proportional to the library size, is added to x to avoid taking the log of zero.

So in your case it is probably these zero counts that get a very small prior, hence a quite negative value in return after the cpm function.

ADD COMMENT
0
Entering edit mode

I want to use log2 cpm values to plot a heatmap. I just realised this effect the expression pattern of my genes as they seem dow regulated but the actual raw count is zero

ADD REPLY
0
Entering edit mode

When I saw an expression of -5, I cannot say that this gene is down regulated. I need a reference, ie the expression of that gene in another group of samples. Which means that I need to center the gene expression levels to their mean (ie remove the mean) or to the average expression level of the group of reference samples.

ADD REPLY
0
Entering edit mode

For heatmaps one usually does Z-transformation, being row mean subtraction and division by row standard deviation. Still, this is a situation where the log2(cpm+1) to me makes sense.

ADD REPLY
2
Entering edit mode

My point was and will be to counterbalance the classic opinions.

Dividing by standard deviation of the whole row is like the auto mode on a camera. It works fairly well, especially for exploration and an initial grasp of the data. If I know what I want to show in my photo, I will probably tune the exposure. The downside of dividing by standard deviation is that integrates signal and noise. That could be good, that could be not so good. When I have a scientific question and want to compare two groups, the signal is the difference between their average and the noise is estimated from the standard deviation within groups. Hum, it sounds like t-test :-) In that case, in addition to the classical processing, I like to view the data after subtracting the mean of the reference group.

When viewing my data (which is not from RNA-seq, I don't work in that field), I also use log2(x+x_prior). Such a transform, similarly to asinh(x/x_prior), aims to reduce the dispersion of x when x are small. I would like highlight one point about the choice of prior. If library sizes are around 1 million and are as homogeneous as in your case, log2(count+1) is very similar to log2(cpm+1). Of course, cpm is count per million (silly me). If library sizes are still homogenous but bigger (let's say 10 M), the prior of 1 has a stronger effect on cpm than on counts, 1 cpm being a count of 10 sequences. So, I think the cpm_prior of 1 was good choice when library size was about a few millions, especially if I want a reduction of the dispersion near 0 stronger than the resulting from log2(count+1). By the way, currently prior.count=2 in cpm functions. Nowadays library size are about 30-40 millions, and I would try to choose a value lower than 1 as cpm_prior in log2(cpm+cpm_prior), let's say 2/mean_library-size. Alternatively, when I use cpm() and want to increase the reduction of dispersion at low values, I increase the prior.count (currently 2 as default). That being said, when library sizes are not so homogeneous, I prefer to work with log2(cpm+cpm_prior), because I better understand the prior I use. And here, I join ATpoint's preference.

That's all I had to say, and I hope it will make you think a bit beyond the usual good approaches.

ADD REPLY

Login before adding your answer.

Traffic: 1307 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6