Hello, I am working with an RNA-seq FeatureCounts output file that supplies the counts for a given ENSG gene ID, as well as the gene length(according to documentation this is in base pairs, not kilobases). Is there a way to obtain approximate TPMs/FPKMs or something similar to TPM's using these columns?
geneID geneLength read counts
ENSG00000121410 4006 3
ENSG00000268895 2793 3
ENSG00000148584 9603 11
ENSG00000175899 6384 1
I've read that it's possible to do the following to get TPMs:
- Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
- Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
- Divide the RPK values by the “per million” scaling factor. This gives you TPM.
Alternatively, it seems that some people use the rpkm function EdgeR to do this (however I need TPMs or FPKM's specifically). What would be the best way to calculate TPMs?
Best way to get TPMs is to use RSEM instead of featureCounts.
The notion of gene length is kinda iffy. Does it mean the length of the entire gene (introns+exons)? Does it mean the lengths of exons? What if one transcript of a gene is expressed very highly while a shorter one of the same gene is not expressed at all -- do we account for that?
I'm not sure how those lengths are determined. Ensembl doesn't think any transcripts of that first gene are 4kb long.
It's gene length, not transcript length. It is the total number of bases covered by any annotated exon for that gene. It merges all the annotated transcripts.
But would you want to use that value for TPM? Or anything?
It works well for RPKM or FPKM. It is the appropriate length measure here because it matches how the genewise read counts are computed.
What must be clear to everyone using RPKM is that simple summing annotated exons is a comparably naive way of getting the gene length as it does not respect in which proportions isoforms are being expressed. Hence, if you want to make statements on expression levels with RPKM this is not accurate. The concern is not theoretical, for example the transcription factor CEBPa (a known driver in hematopoietic malignency, and the isoforms having known and different roles in it) expresses two isoforms with ~25% difference in tx length. It then makes quite a difference whether conditionA expresses 95% isoform-1 and 5% isoform-B but conditionB expresses a 10/90% ratio. featureCounts (please correct if wrong) in my understanding will output the same gene length for all conditions. If this unequal ratio is nested by experimental conditions then RPKMs on naive exon sum is not accurate and one needs dedicated approaches like salmon or rsem to estimate effective length based on transcript expression composition. I would therefore treat featureCounts length output with care if @OP you really want to use that. Whether this affects the big picture is questionable and would need to be checked, it definitely can affect individual genes.
Yes, I am aware of the approximations made by FPKM. (I've been aware of them for 15 years, although people do love to explain it to me again.)
IMO FPKM will probably work fine for OP, just as it did for the original authors of the PECA method that OP is attempting to reproduce. The PECA method is from Wing Wong's group, which is a very strong statistical bioinformatics Lab at Stanford.
This is not the time for a long discussion, but I do wish that people who answer questions on Biostars would treat TPM with a similar level of scepticism. In my experience, FPKM seldom gives a misleading idea of expression changes for more than 1-2 genes per dataset. On the other hand, transcript TPM methods like RSEM or Salmon give TPMs that are wildly wrong for tens of thousands of transcripts for almost every dataset. That is not a criticism of those methods, just an observation on what can be reliably measured from available data.
Dear Professor Smyth,
Can you show me referenced benchmarks that FPKM is misleading for "1-2 genes per dataset" while TPMs "are wildly wrong for tens of thousands of transcripts for almost every dataset".
There are many papers (w/ benchmarks) arguing the limitations of FPKMs and the union-exon approach to gene length (and why TPM is preferable) -- Zhao et al., 2015, the tximport paper, Trapnell et. al 2012, and the blog post (w/ included CSHL talk) from the person who first formulated FPKM which, as you are aware, is a "statistical bioinformatics lab" at Berkeley (now Caltech).
Hello Ram and ATPoint, thank you for helping me repeatedly with my questions. I am now finalizing the paper, and if there is a way to acknowledge you, I would be happy to do so. Thanks again!