How to "guess" the transformation based on already-transformed, "normalized count data"?
2
0
Entering edit mode
21 months ago

Thanks for your attention,

TLDR:

  1. The minimum value in a transformed count matrix is -2.57. How can I guess what transformation was applied?
  2. Any good advice on performing differential gene analysis on such transformed data?

Details:

  1. I would like to analyze RNA data, but the data is controlled. So I downloaded the processed data from the original paper.
  2. According to the authors, the data was processed using "The R-packages, tximport and edgeR, were used to respectively summarize the expression at gene-level and normalize the data."
  3. I found that the maximum value was around 15 so I suspect the data was log-transformed.
  4. Besides, the minimum value was -2.57, which appeared 310861 times in the 20453x96 matrix, with a frequency of 15.8%.

FYI:

  1. Here is the paper: https://www.nature.com/articles/s41467-020-18640-0
  2. the cpm function in edgeR has a default base of 2 and prior.count of 2.
  3. A snapshot of the data:snapshot
edger rnaseq rna rna-seq • 1.2k views
ADD COMMENT
0
Entering edit mode

email the corresponding author of the paper

ADD REPLY
0
Entering edit mode

I have previously emailed the original author to request the raw data (which they cannot share due to EU regulations), but I would like to refrain from bothering them again unless absolutely necessary out of courtesy. Thank you for your attention and guidance.

ADD REPLY
7
Entering edit mode
21 months ago
Gordon Smyth ★ 7.7k

It appears likely that the values are log2-CPM values produced by edgeR::cpm() with log=TRUE. The smallest value that would be returned by that function is equal to log2(2/L) where L is the average normalized library size in millions. It is entirely possible that the average normalized library size for this study would be around 11.9 million, so the smallest log2CPM value would be

> log2( 2 / 11.9 )
[1] -2.57

which is what you have. You would get this minimum value whenever the original count was zero.

The authors say they produced normalized counts using edgeR. The only functions provided by edgeR for exporting normalized counts are cpm() and rpkm(). The values you show are compatible with cpm but not with rpkm so the conclusion would have to be that they are log2CPM values.

You can perform differential expression analyses of log2CPM values using limma-trend. That won't be exactly the same as performing a differential analysis using the original counts, but still very good. If the library sizes are reasonably consistent, as they probably are for this study, then limma-trend has essentially the same power and FDR performance as a quasi-likelihood analysis in edgeR.

ADD COMMENT
0
Entering edit mode

Your answer perfectly solved my problem! I don't have much experience with edgeR, so I didn't consider L. I had previously thought that the author used a different prior.count in log2(0+prior.count), similar to log2(0+2^-2.57), but 0.168 doesn't seem to make much sense. Your speculation seems much more reasonable. Also, thank you for the life-saving solution provided!

ADD REPLY
1
Entering edit mode
21 months ago
LChart 4.7k

Assuming a transform of the format:

y = log(a + x/b)

then:

min(y) ~ log(a)

min(y[y>min(y)]) ~ log(a + 1/b)

Unfortunately I think you just have to make assumptions about the base of the log.

ADD COMMENT
0
Entering edit mode

Gordon Smyth gave his/her speculation based on original log base and prior count, which sounds reasonable. Thank you for your time and guidance!

ADD REPLY

Login before adding your answer.

Traffic: 1803 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6