Question

Purpose of bamCoverage RPKM normalization method

0

Entering edit mode

3.8 years ago

Aspire ▴ 370

bamCoverage --normalizeUsing RPKM does the normalization using a constant bin size. Afaik, the standard normalization using RPKM uses the length of the specific gene (not constant).

So does --normalizeUsing RPKM really represent the standard RPKM normalization (done with the gene length)? What is it purpose, if not?

rpkm bamCoverage • 3.2k views

ADD COMMENT • link updated 3.8 years ago by ATpoint 87k • written 3.8 years ago by Aspire ▴ 370

score 4 · Accepted Answer · 2021-06-08

The problem is that "classic" RPKM starts from a count matrix so you have a single value per gene (or transcript, region, whatever you measure). Bigwigs are interval-based so you have many values for a gene, at most it is at base-pair resolution or you bin the signal. In any case, you cannot meaningfully (I guess) apply a single constant (e.g. a gene length denominator) to the signal.

Tbh I never really got what the binning is good for in bigwigs oper basepair or binther than saving disk space (and making the browser tracks clunky and ugly). Back in the day when I used deeptools I always used bin sizes of one, but since this (using RPKM) would divide the value by 0.001 (because binsize 1 is 0.001kb) it gives unintitively large values, therefore the CPM rather than RPKM is probably a more intuitive choice. In any case these per-million scalings are flawed by design as they do not correct for composition, see for example:

TMM-Normalization

and a potential alternative to scale your tracks:

ATAC-seq sample normalization