I have a question about the calculation of RPKM/FPKM and TPM in regards to how they are normalized for sequencing depth. I can think of at least 3 numbers that could be use as the denominator when doing this normalization: the total number of raw reads, the total number of raw reads that successfully aligned to the genome, or the total number of alignments that are counted by one or more transcripts.
Looking at this blog post, it describes the RPKM normalization for depth step as:
Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
This implies that the total number of raw reads is used, regardless of alignment rate or alignments that were not counted by any gene.
But then the TPM normalization step is:
Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
Since the RPK values are calculated in the previous step and are normalizing for transcript length, then this implies that TPM only considers reads that aligned to a transcript when normalizing for depth.
My main question is whether this difference truly exists in the calculation of these two metrics by common bioinformatics software, or does it depend on the software? For example StringTie will report both FPKM and TPM in its output. Does anyone know if it calculates these metrics with this difference in how unaligned reads and "stray" alignments are included in the normalization?