Hi, when working with highly uneven metagenomic datasets (e.g. soil), where coverage is extremely high for a few dominant organisms (which creates problems during assembly due to sequencing errors adding erroneous edges to the graph), and relatively low for the others, do you use bbnorm to normalize read depth or bbcms for depth filtering? Or both?
This is a question that would benefit if Brian was around to answer. He says the following in BBNorm guide
Normalizes read depth based on kmer counts. Can also error-correct,
bin reads by kmer depth, and generate a kmer depth histogram.
Normalization is often useful if you have too much data (for example,
600x average coverage when you only want 100x) or uneven coverage
(amplified single-cell, RNA-seq, viruses, metagenomes, etc).
I would say use bbnorm.sh before bbcms.sh because latter is filtering based on depth.
Error corrects reads and/or filters by depth, storing kmer counts in a
count-min sketch (a Bloom filter variant).
I have a follow-up question to this. I am assembling contigs from my metagenome reads and then annotating the assembled contigs to determine the functional pool of my sample rather than going into genome binning. I normalized my reads before error correction and assembly, which i know is the right choice for building MAGs...but if I am just interested in the functional pool, should I wait to noramlize until after annotation, when I want to determine the depth of coverage for my functions of interest? Thanks!
I have a follow-up question to this. I am assembling contigs from my metagenome reads and then annotating the assembled contigs to determine the functional pool of my sample rather than going into genome binning. I normalized my reads before error correction and assembly, which i know is the right choice for building MAGs...but if I am just interested in the functional pool, should I wait to noramlize until after annotation, when I want to determine the depth of coverage for my functions of interest? Thanks!