I have RNA-Seq data with raw counts extracted from bam files using featureCounts
. I wanted to convert counts to tpm, because I need TPM for some analysis.
Initially, I got the coding length for each Gene like below:
library(GenomicFeatures)
gtf_txdb <- makeTxDbFromGFF("gencode.v27.annotation.gtf")
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(gtf_txdb,by="gene")
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- sum(width(reduce(exons.list.per.gene)))
exonic.gene.sizes <- as.data.frame(sum(width(reduce(exons.list.per.gene))))
Now, I have two data frames like data
is with raw counts and other data2
with gene name and Coding_Length
tpm <- function(counts, lengths) {
rate <- counts / lengths
rate / sum(rate, na.rm = TRUE) * 1e6
}
final_tpm <- apply(data, 2, function(x) tpm(x, data2$Coding_Length))
final_tpm
using the function for some of the genes:
gene tpm
RPSAP8 3.79
AL645608.8 4.14
RNF223 1.38
And I also used other function like below on dataframe df
which have two columns raw counts
and effLength
calculated from Coding_Length
df$effLength <- df$Coding_Length - 203.7 + 1
countToTpm <- function(counts, effLen)
{
rate <- log(counts) - log(effLen)
denom <- log(sum(exp(rate)))
exp(rate - denom + log(1e6))
}
df$tpm <- with(df, countToTpm(counts, effLength))
With the above way the output of few genes is like below:
gene_name tpm
RPSAP8 1.57
AL645608.8 1.46
RNF223 0.496
Along with the above way, I also used kallisto
to get tpm
. And when I compared both results, they doesn't match and not even close.
target_id length eff_length est_counts tpm
RPSAP8 889 747.358 4.10538 0.0635304
AL645608.8 2086 1944.36 116 0.689984
RNF223 1902 1760.36 50.0024 0.328508
I actually have 300 samples for which I have raw counts
. I wanted to convert them to tpm
and using the above functions but they give different results compared to tpm from kallisto.
Which of these functions is to be used for conversion of counts to tpm? Why they don't match to kallisto tpm?
When compared to TPMs estimated from featureCounts output, the TPM from kallisto should be more correct. As for the reason why, see the answer from Michael Love for the question DESeq2: Is it possible to convert read counts to expression values via TPM and return these values?
I didn't check your functions that calculate TPM, but at least one of them is wrong: if they are using the same input data, and calculating the same value, they should yield the same result. Did you sanity check your TPMs? For example, to they sum to one million?
If you check the functions, for one of the function only
Coding_length
is used and for other functioneffLength
which is calculated fromCoding_Length
is used. And yes, I did the sanity check, the results from both functions givesum to one million
. Among the two functions I used one is suggested by Michael Love. Please check it.And Yes, I can understand that kallisto tpm is better, But I have the raw counts and instead of running kallisto again on all 300 samples, I wanted to convert raw counts to tpm.
kallisto (or Salmon) would run these 300 samples really quickly, so if consider important get better estimates of TPM, you should run them.
What is the reasoning behind this effective length? As you are not using the same gene length in both calculation, the TPMs will differ.
I saw the effective length thing here A review of RNA-Seq expression units. Yes I feel like running kallisto now. One last question. Once I get the transcripts with tpm, can I sum the tpm of transcripts from same gene to make it gene level transcripts?