Hi all,
I want to use the TCGA RNAseqV2 RSEM data to calculate the RPKM value for each gene.
I suppose to do with: RSEM/all(RSEM) *1million
Is there anyone know the right method to calculate the RPKM?
Hi all,
I want to use the TCGA RNAseqV2 RSEM data to calculate the RPKM value for each gene.
I suppose to do with: RSEM/all(RSEM) *1million
Is there anyone know the right method to calculate the RPKM?
As far as I know, there is no way to go from RSEM to RPKM. Is there a specific reason you prefer RPKM over RSEM? RSEM should give expression estimates that are just as good or better than RPKM. From the RSEM paper:
The second measure of abundance is the estimated fraction of transcripts made up by a given isoform or gene. This measure can be used directly as a value between zero and one or can be multiplied by 10^6 to obtain a measure in terms of transcripts per million (TPM). The transcript fraction measure is preferred over the popular RPKM [18] and FPKM [6] measures because it is independent of the mean expressed transcript length and is thus more comparable across samples and species [7].
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks for sharing.
The actual situation is when we look for the differential expressed genes, whether considering the gene length or not will not influence the result.
If we calculate the pearson correlation, the gene length will more or less bring the bias into the results. Even we know the spearman correlation should be much better, considering the gene length in expression values should be more solid when we are going to compare the genes across samples or species.
RSEM already corrects for transcript length, so no additional correction should be necessary.
I don't think so, if you can offer the reference that will be of great help
http://www.biomedcentral.com/1471-2105/12/323
They normalize by the "effective length" which is somewhat different from transcript length - this is described in their methods section. It may be useful to plot the RSEM values against the transcript length to see if the two are independent as claimed.
Do I need change the RSEM value into
log2(RSEM)
to calculated the differentially expressed genes witht.test
?I haven't tried using RSEM values when finding DE genes, so I can't comment whether values should be log transformed. (FPKM values typically look more Gaussian when log transforming, so my gut reaction would be to log transform, but this should be tested).
The authors mention that their output includes 2 values: (1) an estimate of the number of fragments that are derived from a given isoform or gene (similar to read counts) and (2) estimated fraction of transcripts made up by a given isoform or gene. They mention that the first value can be used as input to edgeR or DEseq to determine DE genes. If both of these values are available from TCGA, I would suggest using one of these methods over a simple ttest.