Question

Normalize by contig size BEFORE RPKM normalization?

1

Entering edit mode

3.6 years ago

Arsenal ▴ 160

I am trying to normalize my virus-metagenomics raw counts based on ROUX et al, 2017:

The authors normalize raw counts by contig size. Afterwards, they transform it to RPKM (edgeR) as correction for different library sizes:

Before calculating any index, the read counts were first normalized by the contig length, since viral genome lengths can be highly variable (∼2 orders of magnitude, Angly et al., 2009).

Then, to account for potential differences in library sizes, we compared five different methods: (i) a simple normalization in which counts are divided by the library size, “Normalized” (ii) a method specifically designed to account for under-sampling of metagenomes, from the metagenomeSeq R package, “MGSeq” (iii and iv) two methods designed to minimize log-fold changes between samples for most of the populations, from the edgeR R package, “edgeR”, and the DESeq R package, “DESeq”, and (v) a rarefaction approach whereby all libraries get randomly down-sampled without replacement to the size of the smallest library, “Rarefied” (Fig. S2).

Problem: Elsewhere, Rasmussen et al, 2019 follow Roux et al, although they affirm that RPKM normalization is done to account for contig size, not library size (they even cite Roux et al, 2017);

Prior any analysis the raw read counts in the vOTU-tables were normalized by reads per kilobase per million mapped reads (RPKM) [48], since the size of the viral contigs is highly variable [49]

Please, help me... Which one is correct? Am I missing any "between-the-lines" info?

Thanks!

virus normalization metagenomics edgeR RPKM • 1.8k views

ADD COMMENT • link updated 3.6 years ago by Carambakaracho ★ 3.3k • written 3.6 years ago by Arsenal ▴ 160

score 1 · Answer 1 · 2021-04-26

1

Entering edit mode

3.6 years ago

Carambakaracho ★ 3.3k

The formula accounts for "gene" length and library size: RPKM = numReads / ( geneLength/1000 * totalNumReads/1,000,000 ) -->source

Gene however, can be just any feature you want as long as you count the number of reads across it.

In a metagenome setting it could well be all the contig lengths, provided you map against the entire metagenome. Just create some bed file or simple gff you can feed to, for example, featureCounts.

The rarefaction approach got criticised by some strong statisticians (as far as I remember Susan Holmes and Paul McMurdie brought up the DeSeq2/edgeR idea), albeit being widely used until a while ago.

ADD COMMENT • link 3.6 years ago by Carambakaracho ★ 3.3k

1

Entering edit mode

Thanks for your comment,

We have just received a reply from Simon Roux in his github and he himself explained we simply need to do as Rasmussen: raw counts > RPKM and perhaps the "normalize by contig size 'THEN' RPKM to correct for library size" is only a misunderstood.

I don't know why you mentioned rarefaction, but, yes, indeed rarefaction is a problematic approach.

IMHO, there is still a lot of concepts that should be concerned in virus metagenomics. Good thing that nowadays there are channels like biostars and github where we can discuss and help everyone on reproducibility for analysis.

Cheers

ADD REPLY • link 3.6 years ago by Arsenal ▴ 160

1

Entering edit mode

Ah well, he wrote what I meant - just so much more explicit... Thanks for the feedback here!

And agreed, rarefaction was rather unrelated to your question - I admit I simply got hooked up by the keyword in your citation.

ADD REPLY • link 3.6 years ago by Carambakaracho ★ 3.3k