Question

What exactly is the denominator when normalizing expression for sequencing depth?

1

Entering edit mode

7.5 years ago

colin.kern ★ 1.1k

I have a question about the calculation of RPKM/FPKM and TPM in regards to how they are normalized for sequencing depth. I can think of at least 3 numbers that could be use as the denominator when doing this normalization: the total number of raw reads, the total number of raw reads that successfully aligned to the genome, or the total number of alignments that are counted by one or more transcripts.

Looking at this blog post, it describes the RPKM normalization for depth step as:

Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.

This implies that the total number of raw reads is used, regardless of alignment rate or alignments that were not counted by any gene.

But then the TPM normalization step is:

Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.

Since the RPK values are calculated in the previous step and are normalizing for transcript length, then this implies that TPM only considers reads that aligned to a transcript when normalizing for depth.

My main question is whether this difference truly exists in the calculation of these two metrics by common bioinformatics software, or does it depend on the software? For example StringTie will report both FPKM and TPM in its output. Does anyone know if it calculates these metrics with this difference in how unaligned reads and "stray" alignments are included in the normalization?

RNA-Seq • 4.2k views

ADD COMMENT • link updated 7.5 years ago by Devon Ryan 105k • written 7.5 years ago by colin.kern ★ 1.1k

score 0 · Answer 1 · 2017-10-17

Normally the number of aligned reads are used for this, but as you've noticed this needn't always be the case and every tool can do whatever it wants. For what it's worth, many tools that measure RPKM/FPKM don't know how many total reads you started out with, because you're just giving them a BAM file.

Like other tools, StringTie doesn't necessarily know about your unmapped reads (they're probably not in the BAM file you give it), so it can't then use them in its calculation.

As an aside, I would argue that using unmapped reads in the calculation is incorrect. It's not uncommon to have fairly different alignment rates for different samples and this would result in the RPKM/FPKM values to be drastically different as a result.

As an aside, don't use RPKM/FPKM for anything important, they're incredibly unrobust.