Is there anyway to cancel CPM normalization?
1
0
Entering edit mode
19 months ago
JACKY ▴ 160

I am performing a meta-analysis and I am looking for processed RNA-seq data. I would prefer counts data, but if that is not available, then TPM normalized data would be acceptable.

I found a dataset that only has logCPM normalized RNA-seq data published.

My question is, is there a way to convert this data to regular TPM data? Or even better, to counts data?

Thank you.

r TPM normalization • 1.3k views
ADD COMMENT
0
Entering edit mode
19 months ago

You can undo the log step easily enough. But it will be pretty much impossible to determine what number all the counts were divided by in the CPM step.

And of course you can't just convert CPM to TPM, because you need the transcript length information as well.

ADD COMMENT
0
Entering edit mode

I have the transcript length of each gene. I can extract them directly from biomart. Given I have that, and also lets say the number they devided by was 1000, what is the process to convert CPM to TPM?

ADD REPLY
0
Entering edit mode

Genes have multiple transcripts, and therefore multiple transcript lengths. Do you have the abundances of individual transcripts?

ADD REPLY
0
Entering edit mode

hmm I'm not sure what you mean, but this is my code:

GetGeneLength <- function(counts = counts){

  ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
  genelength =  getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id', 'transcript_length','cds_length'), filters =  'ensembl_gene_id', values = rownames(counts), mart = ensembl, useCache = FALSE)
  gene_canonical_transcript =  getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','transcript_is_canonical'), filters =  'ensembl_gene_id', values = rownames(counts), mart = ensembl, useCache = FALSE)
  gene_canonical_transcript_subset = gene_canonical_transcript[!is.na(gene_canonical_transcript$transcript_is_canonical),]
  genelength = merge(gene_canonical_transcript_subset, genelength, by = c("ensembl_gene_id", "ensembl_transcript_id"))
  return(genelength)

}

This code gives back a single value (a transcript length) for each genes found in the counts matrix.

ADD REPLY
0
Entering edit mode

Yes, it will give you one length, the length of the canonical transcript. But is it right to use only that when you have isoforms present of other lengths?

I just don't see the point of doing a math manipulation when you know the number isn't real. It just distorts the true experimental values.

ADD REPLY

Login before adding your answer.

Traffic: 1618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6