I am trying to normalize my virus-metagenomics raw counts based on ROUX et al, 2017:
The authors normalize raw counts by contig size. Afterwards, they transform it to RPKM (edgeR) as correction for different library sizes:
Before calculating any index, the read counts were first normalized by the contig length, since viral genome lengths can be highly variable (∼2 orders of magnitude, Angly et al., 2009).
Then, to account for potential differences in library sizes, we compared five different methods: (i) a simple normalization in which counts are divided by the library size, “Normalized” (ii) a method specifically designed to account for under-sampling of metagenomes, from the metagenomeSeq R package, “MGSeq” (iii and iv) two methods designed to minimize log-fold changes between samples for most of the populations, from the edgeR R package, “edgeR”, and the DESeq R package, “DESeq”, and (v) a rarefaction approach whereby all libraries get randomly down-sampled without replacement to the size of the smallest library, “Rarefied” (Fig. S2).
Problem: Elsewhere, Rasmussen et al, 2019 follow Roux et al, although they affirm that RPKM normalization is done to account for contig size, not library size (they even cite Roux et al, 2017);
Prior any analysis the raw read counts in the vOTU-tables were normalized by reads per kilobase per million mapped reads (RPKM) [48], since the size of the viral contigs is highly variable [49]
Please, help me... Which one is correct? Am I missing any "between-the-lines" info?
Thanks!
Thanks for your comment,
We have just received a reply from Simon Roux in his github and he himself explained we simply need to do as Rasmussen:
raw counts > RPKM
and perhaps the "normalize by contig size'THEN'
RPKM to correct for library size" is only a misunderstood.I don't know why you mentioned rarefaction, but, yes, indeed rarefaction is a problematic approach.
IMHO, there is still a lot of concepts that should be concerned in virus metagenomics. Good thing that nowadays there are channels like biostars and github where we can discuss and help everyone on reproducibility for analysis.
Cheers
Ah well, he wrote what I meant - just so much more explicit... Thanks for the feedback here!
And agreed, rarefaction was rather unrelated to your question - I admit I simply got hooked up by the keyword in your citation.