Hi everyone,
Me and my colleagues have been thinking about normalization for quite some time, and we have a hard time finding a consensus both within us, and within the literature, for this reason, this is going to be a long and detailed post. Talking with other researchers we feel that they are also experiencing the same problems/questions, and therefore, we really hope we can get detailed answers, or instead, that we can start a dialogue on how to proceed with the normalization of metagenomic data. Our doubts are divided into two significant cases, normalization on metagenomes, and on MAGS.
CASE 1) Starting with the case of the metagenomes (shot-gun metagenomics):
Let us assume we start with metagenomes of interest, and we want to know the abundance of multiple genes of interest (as hmm models). Until now, our workflow is illustrated below:
- Trimming (fastp)
- Assembly (Metaspades)
- ORF prediction (prodigal)
- Read-mapping (bbmap) and pileup.sh ( to get the average coverage)
- hmmsearch ( with hmm models)
- Normalization?
Now, we will let you know what has been our procedure, and our major assumptions, and we would like to know, if they are correct and if not, why and how could we proceed in the best way possible.
To simplify we explain our workflow with just two genes (gene X and gene Y) and multiple metagenomes (Sample A, Sample B, and Sample C).
After running prodigal on our metagenomes, we mapped the reads with bbmap, and we get the average coverage of each contig.
First assumption: We are considering the average coverage of each identified orf as the average coverage of the contig where the orf was predicted, in order to reduce the bias of differential gene length. For example:
orf1 contig_1 17.5
orf2 contig_1 17.5
orf3 contig_2 5
orf4 contig_3 26.8
...
Then we run hmmsearch for our gene of interest (geneX) on the predicted orf protein sequences.
After we get the output from hmmsearch, we did our second assumption:
Second assumption: We defined the total coverage of gene X, as the sum of all the average coverage of the multiple hits obtained from hmm search.
gene X hits:
orf1 contig_1 31.6
orf2 contig_1 45.6
orf3 contig_2 25.9
gene Y hits:
orf6 contig_8 6.9
orf9 contig_11 3.4
orf11 contig_11 3.4
total coverage of gene X = 31.6 + 45.6 + 25.9 = 103.1 total coverage of gene Y = 6.9 + 3.4 + 3.4 = 13.7
Now, imagine that we have computed the abundance of gene X and gene Y in multiple metagenomes (all the samples above)
sampleA sampleB sampleC ...
geneX 103.1 53.2 99.2
geneY 13.7 23.3 28.2
...
We now need to perform some normalization to account for the differences between the metagenomes (for example, metagenome length).
Our temporary normalization goes as follows for each of the genes: Normalization= (Total coverage/total size of the metagenome) * 1000000 (for improved readability)
Checking the literature (https://doi.org/10.1186/s12864-018-4637-6) we saw that there are other types of normalization such as TMM, RLE, etc.
Which type of normalization would you consider, and why?. We think that we are following the right approach, but we are not sure if also the TMM or RLE can be applied to our procedure. We read also about normalization based on housekeeping genes. In this latter case, how will you define which housekeeping genes to use? and also, what if in the case that the assembler does not assemble the housekeeping genes due to their conservative nature? Do you have any suggestions for alternative normalization methods?
CASE 2) The second case we want to highlight regards the same procedure but without the average coverage. So after the HMMSEARCH, we sum the hits, and not the average coverage reported. This way instead of having (as shown above):
total coverage of **gene X** = 31.6 + 45.6 + 25.9 = 103.1
total coverage of **gene Y** = 6.9 + 3.4 + 3.4 = 13.7
we would have:
total hit counts of **gene X** = 3
total hit counts of **gene Y** = 3
This, in our opinion, illustrates how the two approaches can be biased: we have the same number of hits but with very different results.
Can this approach be used as well? what are the pros and the fallbacks of using this?
CASE 3) Now, regarding the MAGS approach (never tested):
Starting with the obtained high-confidence MAGS (after 3 binners, DAStool, and checkM). Do we need to think of another type of quantification and normalization specific to the MAGS? Such as the one employed by (https://doi.org/10.1038/s41467-021-22736-6), where we can do the assumption as the gene abundance can be considered the same as the MAG abundance, computed as follows:
For each MAG, the total length of mapped reads for individual scaffolds (mapped reads using BWA algorithm) is summed up and the total is then divided by the MAG size in bp. This number is then divided by the total number of reads to obtain the relative abundance.
Additionally, similarly to the previous case, when one does not have access to the raw reads, is it accurate to compute the abundance as (similarly to the second case):
gene abundance = the total number of hits/total number of orf predicted from the MAG?
can we use our approach also for the MAGs case?
Do you know another package (or integrated method) that is able to compute what we want?
We look forward to a detailed answer, and we hope we can start a productive dialogue about these issues.
Thank you very much.