Question

RPKM, how to normalize for library size

0

Entering edit mode

3.4 years ago

bs2 • 0

Hi everybody! I'm working on several metagenomes (so not trascriptomes), and I mapped high quality reads to a database made of nucleotide sequences through bowtie2. I want to convert raw mapping counts of each gene in RPKM. I know this method might not be appropriate for this purpose, so please don't take this into account.

I obtained the formula to use from here. So, this is the question: in my case, 'totalNumReads' should refer to the total number of reads that successfully aligned to a gene (that is, the sample-wise sum of the counts), or to the library size, that is the total number of reads that I have in the metagenome (so also accounting for unmapped reads).

I am curious to know your opinion about this. Thanks

rpkm dna-seq metagenomes • 1.7k views

ADD COMMENT • link updated 3.4 years ago by Istvan Albert 103k • written 3.4 years ago by bs2 • 0

0

Entering edit mode

It's reads mapped to genes, so basically if you have a matrix of counts with rows being genes and columns being samples -- it is the sum of every column. Unmapped reads make no sense as that would change the expression of a gene if you have more unmapped reads, e.g. a more noise dataset but that has nothing to do with reads assigned to genes.

ADD REPLY • link 3.4 years ago by ATpoint 89k

0

Entering edit mode

Ok, I got you, but in this case I'm not taking into account the size of the library, right? For example, if I have metagenome1 with 20M reads and metagenome2 with 10M reads, the count of reads mapping to a gene X might be higher in metagenome1 solely because it has more reads. And this might be independent from the sum of reads mapping to all the genes. This is the point that confuses me most.

ADD REPLY • link 3.4 years ago by bs2 • 0

0

Entering edit mode

The library size is the sum of the column, so it takes it into account. The colSum is the effective library size that matters.

ADD REPLY • link 3.4 years ago by ATpoint 89k

score 0 · Answer 1 · 2022-05-19

the totalNumReads are the total mapped reads overall.

The RPKM is a normalization that is meant to take into account two factors

The amount of usable data
The length of the donor sequence

hence it divides with both. Doing so "normalizes" (makes comparable) values obtained from less or more data, and from shorter and longer donor sequences.