Question

Is there anyway to cancel CPM normalization?

0

Entering edit mode

19 months ago

JACKY ▴ 160

I am performing a meta-analysis and I am looking for processed RNA-seq data. I would prefer counts data, but if that is not available, then TPM normalized data would be acceptable.

I found a dataset that only has logCPM normalized RNA-seq data published.

My question is, is there a way to convert this data to regular TPM data? Or even better, to counts data?

Thank you.

r TPM normalization • 1.3k views

ADD COMMENT • link updated 18 months ago by swbarnes2 14k • written 19 months ago by JACKY ▴ 160

score 0 · Answer 1 · 2023-04-17

0

Entering edit mode

19 months ago

swbarnes2 14k

You can undo the log step easily enough. But it will be pretty much impossible to determine what number all the counts were divided by in the CPM step.

And of course you can't just convert CPM to TPM, because you need the transcript length information as well.

ADD COMMENT • link 19 months ago by swbarnes2 14k

0

Entering edit mode

I have the transcript length of each gene. I can extract them directly from biomart. Given I have that, and also lets say the number they devided by was 1000, what is the process to convert CPM to TPM?

ADD REPLY • link 19 months ago by JACKY ▴ 160

0

Entering edit mode

Genes have multiple transcripts, and therefore multiple transcript lengths. Do you have the abundances of individual transcripts?

ADD REPLY • link 19 months ago by swbarnes2 14k

0

Entering edit mode

hmm I'm not sure what you mean, but this is my code:

GetGeneLength <- function(counts = counts){

  ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
  genelength =  getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id', 'transcript_length','cds_length'), filters =  'ensembl_gene_id', values = rownames(counts), mart = ensembl, useCache = FALSE)
  gene_canonical_transcript =  getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','transcript_is_canonical'), filters =  'ensembl_gene_id', values = rownames(counts), mart = ensembl, useCache = FALSE)
  gene_canonical_transcript_subset = gene_canonical_transcript[!is.na(gene_canonical_transcript$transcript_is_canonical),]
  genelength = merge(gene_canonical_transcript_subset, genelength, by = c("ensembl_gene_id", "ensembl_transcript_id"))
  return(genelength)

}

This code gives back a single value (a transcript length) for each genes found in the counts matrix.

ADD REPLY • link 18 months ago by JACKY ▴ 160

0

Entering edit mode

Yes, it will give you one length, the length of the canonical transcript. But is it right to use only that when you have isoforms present of other lengths?

I just don't see the point of doing a math manipulation when you know the number isn't real. It just distorts the true experimental values.

ADD REPLY • link 18 months ago by swbarnes2 14k