Question

Convert FPKM to TPM in R

2

Entering edit mode

2.6 years ago

AlexStar ▴ 170

I'm conducting a meta-analysis over several datasets. I want to combine those datasets and run some machine learning algorithms to predict a target response. Some of those datasets are raw counts, which I can easily convert to TPM with the following code:

rpkm <- apply(X = subset(counts_data),
                MARGIN = 2,
                FUN = function(x) {
                  10^9 * x / genelength / sum(as.numeric(x))
                })

TPM <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()

And some datasets provide RPKM data, which I can also convert to TPM like this:

TPM= apply(RPKM, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()

Some datasets, however, only provide FPKM data. This is problamatic, I need all datasets to be TPM normalized, and I'm not familiar with converting FPKM to TPM.

Is it possible to convert FPKM reads to TPM? I found this approach: TPM = FPKM * X where X = 1e6/[sum of all FPKM of a sample].

I'm not sure if I'm allowed to do this, I don't want to use it and get misleading results. What to do guys think? if I can use it, what is the code in R?

Note: the datasets that provide RPKM or FPKM have no raw data or counts data.

r TPM normalization meta-analysis • 7.5k views

ADD COMMENT • link updated 14 months ago by Ram 45k • written 2.6 years ago by AlexStar ▴ 170

Ram · Answer 1 · 2023-11-16

4

Entering edit mode

17 months ago

DareDevil ★ 4.4k

TPM(i) = ( FPKM(i) / sum ( FPKM all transcripts ) ) * 10^6

TPM = (((mean transcript length in kilobases) x RPKM) / sum(RPKM all genes)) * 10^6

To convert fpkm to tpm first generate dummy FPKM data

num_genes <- 1000
num_samples <- 5

fpkm_matrix <- matrix(rexp(num_genes * num_samples, rate = 0.1), nrow = num_genes)
colnames(fpkm_matrix) <- paste0("Sample_", 1:num_samples)
rownames(fpkm_matrix) <- paste0("Gene_", 1:num_genes)

Create a function for tpm based on above formula

sum_fpkm_per_sample <- colSums(fpkm_matrix)
scaling_factors <- sum_fpkm_per_sample / 1e6
tpm_matrix <- t(t(fpkm_matrix) / scaling_factors * 1e6)

ADD COMMENT • link 15 months ago by DareDevil ★ 4.4k

1

Entering edit mode

I apologize for my previous comment - the code looked really similar to the one generated by ChatGPT and your history of using ChatGPT triggered a suspicion. I'll delete my other comment.

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

is really the function working as expected ?

The output for the gene_expression dataset is :

print(tpm_data)
Gene    Sample1  Sample2  Sample3
1 Gene1 166666.7 250000.0 133333.3
2 Gene2 266666.7 333333.3 240000.0
3 Gene3 555555.6 648148.1 518518.5

If we had sequenced only two samples, the output is different :

print(tpm_data2)
Gene    Sample1  Sample2
1 Gene1 166666.7 200000.0
2 Gene2 266666.7 416666.7
3 Gene3 500000.0 466666.7

Here, based on your function, a modification that yields same result.

fpkm_to_tpm <- function(fpkm_dat){ 

fpkm_dat %>% 
pivot_longer(-names(fpkm_dat)[1], names_to = "sample", values_to = "fpkm") %>% 
group_by(tissue) %>% 

mutate(total_fpkm_per_sample = sum(fpkm),            # sum of FPKM values per sample
       scaling_factor  = total_fpkm_per_sample/ 1e6, # scaling factor per sample
       tpm_values = fpkm / scaling_factor) %>%       # calculate TPM values

select(names(fpkm_dat)[1], sample, tpm_values) %>% 
pivot_wider(names_from = "sample", values_from = tpm_values) 
}

Function called on the gene_expression dataset :

fpkm_to_tpm(gene_expression)
# A tibble: 3 × 4
Gene   Sample1 Sample2 Sample3
<chr>    <dbl>   <dbl>   <dbl>
1 Gene1 166667. 200000  148148.
2 Gene2 333333. 333333. 333333.
3 Gene3 500000  466667. 518519.

Function called on a subset of the gene_expression dataset :

fpkm_to_tpm(gene_expression %>% select(1:3))
# A tibble: 3 × 3
 Gene  Sample1 Sample2
 <chr>   <dbl>   <dbl>
1 Gene1 166667. 200000 
2 Gene2 333333. 333333.
3 Gene3 500000  466667.

ADD REPLY • link updated 14 months ago by Ram 45k • written 15 months ago by josev.die ▴ 70

score 1 · Answer 2 · 2022-10-03

1

Entering edit mode

2.6 years ago

Matthias Zepper 5.1k

I think your TPM from FPKM calculation is correct. See the section Relationship between TPM and FPKM in this helpful blog post by Harold Pimentel that recites a manuscript by his PhD advisor Lior Patcher.

ADD COMMENT • link 2.6 years ago by Matthias Zepper 5.1k

3

Entering edit mode

Great! Thank you! If anyone in the future need a solution for this question, here is the code to do this:

library(tidyverse); fpkm_data%>% mutate(across(everything(), ~(./sum(.))*10**6)

ADD REPLY • link 2.6 years ago by AlexStar ▴ 170

Ram · Answer 3 · 2024-02-27

FPKM can be converted to TPM and the approach you found is correct (and it is also same as the RPKM to TPM conversion). FPKM and RPKM are conceptually same normalization, only that they are applied to paired-end and single-end RNA-seq methods, respectively. If you notice, your function to convert RPKM to TPM

TPM <- apply(RPKM, 2, function(x) x / sum(as.numeric(x)) * 10^6)

is doing the same thing as the approach you mention for FPKM

TPM = FPKM/[sum of all FPKM of a sample]*10^6

I have made a few modifications to your own codes and functions above to make the similarities more clear.