For a given cancer type in the NIH Cancer Genome Atlas, I visit the data portal and download UNC RNASeqV2, level 3 expression data. Specifically, I grab files that end with the extension *.rsem.genes.normalized_results
Each file contains one line per gene, with the gene name and (I assume) its normalized FPKM expression value. I am assuming these data are normalized FPKM based on the filename and the UNC RNASeqV2 protocol description hosted on TCGA.
My questions are:
Are these expression data really measured in FPKM?
If they are, how should I convert from FPKM to TPM, for all the expression values for a given gene?
You can't recover TPMs from gene-level FPKMs. The data on transcripts has already been lost.
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 9.2 years ago by
gc
▴
20
0
Entering edit mode
I don't understand your comment. I've quickly compared the FPKMs for a given gene and it's transcripts and noticed (as one could expect) that the gene-level FPKM is the sum of all FPKM of it's transcripts. So it would not really make a difference if you calculate the TPM from gene or transcript-level FPKMs, I conclude. Hereafter one example:
Quick question. If I have between sample normalized FPKMs, do I just sum the FPKMs of all the transcripts for a given gene within a sample, or do I sum all of those and for all those transcripts in the other samples. I'm just thinking, if you have three transcripts and two samples, that is different maths.
You need to post this as a new question and refer back to this thread if necessary. Each thread starts with a question followed by answers - new questions should not be posted in the answer section. That's what makes this site better than others. (Moderation: your answer will be moved to a comment)
At the end of this blog post, a simple formula is provided to compute TPM from FPKM:
TPMi=( FPKMi / sum(FPKMj ) * 10^6
edit: well, from the protocol you linked, and also from this wiki, the UNC V2 RNA-Seq Workflow uses MapSplice+RSEM, so I guess measures are already given as TPM - check here and here.
Maybe it's a little bit old, but just for future access...
@h.mon answer your second question.
For your first question: 1. Are these expression data really measured in FPKM?
Following the wiki cited by @h.mon, *.rsem.genes.normalized_results as well as *.rsem.isoforms.normalized_results have measures in normalized_count (upper quartile normalized RSEM count estimates) and not RPKM, FPKM or TPM.
You can't recover TPMs from gene-level FPKMs. The data on transcripts has already been lost.
I don't understand your comment. I've quickly compared the FPKMs for a given gene and it's transcripts and noticed (as one could expect) that the gene-level FPKM is the sum of all FPKM of it's transcripts. So it would not really make a difference if you calculate the TPM from gene or transcript-level FPKMs, I conclude. Hereafter one example:
Are you sure you need the TPM (Transcripts Per Million) data? If you are fine with the data at the gene level you should be OK as it is
I'd like the TPM data, if possible.
Quick question. If I have between sample normalized FPKMs, do I just sum the FPKMs of all the transcripts for a given gene within a sample, or do I sum all of those and for all those transcripts in the other samples. I'm just thinking, if you have three transcripts and two samples, that is different maths.
You need to post this as a new question and refer back to this thread if necessary. Each thread starts with a question followed by answers - new questions should not be posted in the answer section. That's what makes this site better than others. (Moderation: your answer will be moved to a comment)