Question

Obtaining TPM values from STAR alignment and counts with featurecounts using R's tidyverse syntax (dplyr and tidyr)

0

Entering edit mode

19 months ago

Josh ▴ 20

Hello!

I have a table of counts that I got by aligning rna seq samples with STAR and using featureCounts, and my goal is to get TPM values for each gene of the table.

As a first step, I imported my table into R and modified it a bit to make it more understandable for me.

# load required packages
library("pacman")

p_load("vroom", "tidyverse")


# load data
GSE_fc <- vroom("/data/GSE_fc-table.txt", 
                      skip = 1) %>%
  select(1, 6:10) %>%
  rename("SRR0" = 3,
         "SRR1" = 4,
         "SRR2" = 5,
         "SRR3" = 6)

The table of counts looks like this:

 > head(GSE_fc)
# A tibble: 6 × 6
  Geneid          Length      SRR0       SRR1       SRR2       SRR3
  <chr>            <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
1 ENSG00000160072   7686       1790        537       1366        740
2 ENSG00000279928    570          0          0          0          0
3 ENSG00000228037    626          0          0          0          0
4 ENSG00000142611   9767          2          0          1          0
5 ENSG00000284616   1225          0          0          0          0
6 ENSG00000157911   3782        988        429       1028        444

I used as a reference the following links that indicate how to obtain TPM values using the Length column and the columns containing the count values from the count table:

http://ny-shao.name/2016/11/18/a-short-script-to-calculate-rpkm-and-tpm-from-featurecounts-output.html

https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

Raw counts to TPM in R

I adapted pieces of code to calculate TPM values and transformed it to dplyr and tidyr syntax (R packages). It looks like this:

# function to calculate tpm values 
tpm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(rate) * 1e6
}

# code to obtain tpm values from a dataframe
GSE128263_fc.tpm <- GSE128263_fc %>% 
  pivot_longer(starts_with("SRR"), 
               names_to = "sample", 
               values_to = "counts") %>% 
  group_by(sample) %>% 
  mutate(tpm=tpm(counts, Length)) %>% 
  pivot_wider(names_from = sample,
              values_from = c(4:5))

However, reading other similar posts that also calculate TPM,

Calculating TPM from featureCounts output

I see that they do the calculation in a different way and my specific question is to know if what I did, and wrote in this post, is right.

My table at the end of the calculation of the TPM values looks like this

> head(GSE128263_fc.tpm)
# A tibble: 6 × 10
  Geneid   Length count…¹ count…² count…³ count…⁴ tpm_S…⁵ tpm_S…⁶ tpm_S…⁷ tpm_S…⁸
  <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 ENSG000…   7686    1790     537    1366     740 35.6       33.0 23.9       25.0
2 ENSG000…    570       0       0       0       0  0          0    0          0  
3 ENSG000…    626       0       0       0       0  0          0    0          0  
4 ENSG000…   9767       2       0       1       0  0.0313     0    0.0138     0  
5 ENSG000…   1225       0       0       0       0  0          0    0          0  
6 ENSG000…   3782     988     429    1028     444 40.0       53.6 36.6       30.4

Thank you for your attention and support in advance

R rnaseq tidyverse tpm • 2.9k views

ADD COMMENT • link 19 months ago by Josh ▴ 20

score 2 · Answer 1 · 2023-05-02

2

Entering edit mode

19 months ago

swbarnes2 14k

Hang on. You want to use transcript lengths, not gene lengths. How can you think that the lengths of introns is relevant here? And since each gene has many transcripts of different lengths, you would need to know the relative abundance of all those transcripts, which STAR can't do for you.

Just start over with Kallisto, it will generate TPM calculated properly. Or, if you have the transcript alignment from STAR, RSEM can use that input to give you TPM.

ADD COMMENT • link 19 months ago by swbarnes2 14k

2

Entering edit mode

You can also give the transcript alignments from STAR directly to salmon which will give you properly computed TPMs (and will do so faster than RSEM).

ADD REPLY • link 19 months ago by Rob 6.9k

1

Entering edit mode

Gene lengths values from featureCounts do not include introns.

ADD REPLY • link 19 months ago by Gordon Smyth ★ 7.7k

0

Entering edit mode

But it's still not the right value. The overlap of all the exons doesn't represent the real length of anything, "correcting" with it doesn't solve anything.

ADD REPLY • link 19 months ago by swbarnes2 14k

0

Entering edit mode

I did not say it was the right value for TPMs. featureCounts is not designed to compute TPMs anyway. I am simply asking you to edit your answer to remove the misinformation about introns.

BTW, if you look back through past threads, you will see that OP has already run Salmon.

ADD REPLY • link 19 months ago by Gordon Smyth ★ 7.7k

1

Entering edit mode

Thank you all for your answers and suggestions. I will re-run the alignments with STAR and get the counts and TPM values with salmon. :)

ADD REPLY • link 19 months ago by Josh ▴ 20