Hi,
I am currently analysing a set of ~700 genes for which I have RNAseq read counts. I have calculated the differentially expressed genes and their relative rpkm and tpm, but I am left with a question that is more theoretical.
When I calculate the RPKM for each gene in each sample, the values I get are somewhat in the range of what I'm used to see when I do RNASeq data analysis.
> max(rpkm[ , "F3_A_R1"])
[1] 681.6962
> min(rpkm[ , "F3_A_R1"])
[1] 0
When I convert to TPM, the values go overboard.
> max(tpm[ , "F3_A_R1"])
[1] 158783.9
> min(tpm[ , "F3_A_R1"])
[1] 0
This is how I'm calculating RPKM:
# sum counts for each sample, then divide each sum by a million
scalingFactors <- colSums(raw_cts) / 1e6
# divide raw counts by their scaling factor
rpm <- raw_cts / scalingFactors
# divide each rpm by the gene length to get rpkm
rpkm <- rpm / orf_lengths
And this is how I'm converting to TPM:
tpm <- rpkm
for (i in 1:length(colnames(tpm))) {
tpm[ , i] <- ( tpm[ , i] / sum(tpm[ , i]) ) * 1e6
}
Basically, I use the known formulas, as explained for example here.
Why do you think that there is such a difference between TPM and RPKM? My guess is that the limited number of genes makes both RPKM and TPM very unpredictable / unreliable as a measure.
Do you have any suggestion, on how to balance this out?
EDIT #1:
I have tried to add a "fake" gene where to put all the "unassigned" reads. This has reduced the RPKM / TPM values drastically, the max that I observe now is RPKM = 43.4. Of course this approach has the opposite issues, but I tend to consider it more reliable since I am using these RPKMs and TPMs to estimate relative abundance of transcripts within samples, later on.
EDIT #2:
I decided to not use RPKM and TPM. Although these are normalizeed measurements, with such few genes it is unreliable to use them. Instead of solving biases, they introduce some.
I don't think that RPKM or TPM are meant to be calculated for a handful of genes, you need to compute them genome-wide and keep the same values after subsetting genes.
In my workflow, the reads are mapped to certain known genes of a metagenome. While I cannot include other genes because these 700 are all I have, I still have the total number of reads. Would it help, to create a fake gene called "not_assigned" to which I assign all the reads that aren't assigned to any of the 700 genes in my gene set? And then keep it in for the RPKM and TPM calculation
It is not at all clear to me that those normalizations are sound when used on only a subset of genes.
Yep, that is exactly my doubt.