Hello!
I have a table of counts that I got by aligning rna seq samples with STAR and using featureCounts, and my goal is to get TPM values for each gene of the table.
As a first step, I imported my table into R and modified it a bit to make it more understandable for me.
# load required packages
library("pacman")
p_load("vroom", "tidyverse")
# load data
GSE_fc <- vroom("/data/GSE_fc-table.txt",
skip = 1) %>%
select(1, 6:10) %>%
rename("SRR0" = 3,
"SRR1" = 4,
"SRR2" = 5,
"SRR3" = 6)
The table of counts looks like this:
> head(GSE_fc)
# A tibble: 6 × 6
Geneid Length SRR0 SRR1 SRR2 SRR3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000160072 7686 1790 537 1366 740
2 ENSG00000279928 570 0 0 0 0
3 ENSG00000228037 626 0 0 0 0
4 ENSG00000142611 9767 2 0 1 0
5 ENSG00000284616 1225 0 0 0 0
6 ENSG00000157911 3782 988 429 1028 444
I used as a reference the following links that indicate how to obtain TPM values using the Length column and the columns containing the count values from the count table:
https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
I adapted pieces of code to calculate TPM values and transformed it to dplyr and tidyr syntax (R packages). It looks like this:
# function to calculate tpm values
tpm <- function(counts, lengths) {
rate <- counts / lengths
rate / sum(rate) * 1e6
}
# code to obtain tpm values from a dataframe
GSE128263_fc.tpm <- GSE128263_fc %>%
pivot_longer(starts_with("SRR"),
names_to = "sample",
values_to = "counts") %>%
group_by(sample) %>%
mutate(tpm=tpm(counts, Length)) %>%
pivot_wider(names_from = sample,
values_from = c(4:5))
However, reading other similar posts that also calculate TPM,
Calculating TPM from featureCounts output
I see that they do the calculation in a different way and my specific question is to know if what I did, and wrote in this post, is right.
My table at the end of the calculation of the TPM values looks like this
> head(GSE128263_fc.tpm)
# A tibble: 6 × 10
Geneid Length count…¹ count…² count…³ count…⁴ tpm_S…⁵ tpm_S…⁶ tpm_S…⁷ tpm_S…⁸
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG000… 7686 1790 537 1366 740 35.6 33.0 23.9 25.0
2 ENSG000… 570 0 0 0 0 0 0 0 0
3 ENSG000… 626 0 0 0 0 0 0 0 0
4 ENSG000… 9767 2 0 1 0 0.0313 0 0.0138 0
5 ENSG000… 1225 0 0 0 0 0 0 0 0
6 ENSG000… 3782 988 429 1028 444 40.0 53.6 36.6 30.4
Thank you for your attention and support in advance
You can also give the transcript alignments from STAR directly to salmon which will give you properly computed TPMs (and will do so faster than RSEM).
Gene lengths values from featureCounts do not include introns.
But it's still not the right value. The overlap of all the exons doesn't represent the real length of anything, "correcting" with it doesn't solve anything.
I did not say it was the right value for TPMs. featureCounts is not designed to compute TPMs anyway. I am simply asking you to edit your answer to remove the misinformation about introns.
BTW, if you look back through past threads, you will see that OP has already run Salmon.
Thank you all for your answers and suggestions. I will re-run the alignments with STAR and get the counts and TPM values with salmon. :)