Question

From TPM to raw counts

1

Entering edit mode

16 months ago

Gama313 ▴ 130

I am deconvoluting a bulk RNASeq experiment using scRNA to generate a signature of cell types using CIBERSORTX. The program asks you bulk data normalized, so I used TPM. The finction 'high resolution' returns normalized expressione (I presume) per cell type. To perform differential expression, raw counts are required, so I need to re-transform data. Having tpm and gene-lengths, it is possible to re-transform tpm to raw counts?

tpm Deconvolution rnaseq • 2.5k views

ADD COMMENT • link 14 months ago by Gama313 ▴ 130

1

Entering edit mode

I'm posting this as a comment instead of an answer specifically because it's just what I would do and I don't know if it's the best approach in your case. But, whenever I want to generate data that I can trust, I start from raw reads. It sounds to me like you are presuming a lot of things. The people who generated the data probably had their own goals, biases, and methods; do you really want those to influence your results? If you fully understand what they did, and what you are doing, then it's trivial to redo it yourself unless it's very computationally-intensive, which you haven't mentioned.

ADD REPLY • link 16 months ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Brian for the suggestion. However, I did the whole process, from bulk counts generation, to data transformation and scrna deconvolution. The only assumption that I am not 100% sure about is that CIBERSORTX will generate tpm deconvolved data if starting from tpm data. As far as I know, cibersortx performs a linear transformation of normalized data, so in principle, it is correct to assume that celltype-specific gep are in tpm format.

ADD REPLY • link 16 months ago by Gama313 ▴ 130

Istvan Albert · Answer 1 · 2023-12-01

3

Entering edit mode

16 months ago

Istvan Albert 102k

In principle, the TPM formula can be reverted, see the timeless post

What the FPKM? A review of RNA-Seq expression units

In practice, some tools may apply additional corrections and scaling to the data.

As Brian Bushnell mentions, undoing these kinds of transformations without sufficient background information can be sketchy.

Form the post linked above the formulas are

countToTpm <- function(counts, effLen)
{
    rate <- log(counts) - log(effLen)
    denom <- log(sum(exp(rate)))
    exp(rate - denom + log(1e6))
}

countToFpkm <- function(counts, effLen)
{
    N <- sum(counts)
    exp( log(counts) + log(1e9) - log(effLen) - log(N) )
}

fpkmToTpm <- function(fpkm)
{
    exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
}

countToEffCounts <- function(counts, len, effLen)
{
    counts * (len / effLen)
}

################################################################################
# An example
################################################################################
cnts <- c(4250, 3300, 200, 1750, 50, 0)
lens <- c(900, 1020, 2000, 770, 3000, 1777)
countDf <- data.frame(count = cnts, length = lens)

# assume a mean(FLD) = 203.7
countDf$effLength <- countDf$length - 203.7 + 1
countDf$tpm <- with(countDf, countToTpm(count, effLength))
countDf$fpkm <- with(countDf, countToFpkm(count, effLength))
with(countDf, all.equal(tpm, fpkmToTpm(fpkm)))
countDf$effCounts <- with(countDf, countToEffCounts(count, length, effLength))

ADD COMMENT • link 16 months ago by Istvan Albert 102k

0

Entering edit mode

Thanks for the answe and the linkI used bioinfokit tpm formula to calculate tpm from bulk which is the same formula given in your link:

A= (counts / lengths) * 10e3
Tpm=( A* 10e6) / A.sum()

Where counts is the sample × gene raw counts matrix, A.sum() is the rowSums(counts/length*10e3) length-normalized total count per sample.

In principle, if I know tpm and gene lengths I should be able to reverse the transformation, however I am not good at math and I am struggling.

ADD REPLY • link updated 16 months ago by Istvan Albert 102k • written 16 months ago by Gama313 ▴ 130

0

Entering edit mode

Now that I thought about it more, you'll need the total number of reads mapped to produce the normalization factor.

You can undo the TPM only if you know how many total aligned reads were used to produce the TPM value, if you knew that, then the undo formula would be something like

count = tpm * totalMappedReads / 1e6 * length/1000

I will say that most of the time the totalMappedReads (for each sample) is not reported

ADD REPLY • link 16 months ago by Istvan Albert 102k

0

Entering edit mode

I will give it a try and I will report results! Thanks for your help!

ADD REPLY • link 16 months ago by Gama313 ▴ 130

0

Entering edit mode

Sorry for the late response. I ve tried your formula on bulkRNA data and it worked. However, it is not applicable on deconvoluted-TPM since the term "totalMappedReads" is not known. At the moment I have not much time to try alternatives. If I'll get positive updates I'll update the question. Thanks

ADD REPLY • link 14 months ago by Gama313 ▴ 130