Hello everyone. I would like to start a discussion here.
Recently, I responded to a question on Seqanswers regarding the binning of rare microbes from shotgun metagenomes (https://www.seqanswers.com/forum/applications-forums/metagenomics/324515-software-recommendation-for-metagenome-assembly-for-low-abundance-bacteria). I suggested an idea for an appropriate pipeline for this task, but unfortunately, it did not gain much attention. Nevertheless, I tried implementing my idea and I will share it here for further discussion.
Let's assume that we have several metagenomes representing a specific area. We expect that all our low-abundance taxa of interest are uniformly distributed across the area, along with their corresponding DNA and reads.
The first step is to assemble contigs, for which we could use a tool like Metaspades. From there, we can individually construct MAGs (Metagenome-Assembled Genomes) from each sample. It is preferable to use three or four binners, such as Concoct, Metabat, and Maxbin. We can then use DAStool to obtain consensus bins from the results of all three or more software.
Now that we have MAGs constructed from the most abundant and less diverse DNA in the samples, the next step is to combine all the raw metagenomic datasets into one file and merge all the constructed MAGs into another file. The combined-MAGs file can be used to build an indexed database, for example, using Bowtie2. We can then map the combined raw metagenomic reads onto this database, saving only the reads that were not utilized for contigs and bins assembly. Since this object contains all the DNA from the initial samples, but not the DNA of the most abundant taxa, we expect that the quantity of reads belonging to low-abundance taxa would be sufficient to assemble contigs and cluster them into bins.
Therefore, the next step is to use an assembler and a binner on this dataset enriched with less abundant DNA. The hypothesis is that this will result in more or less complete MAGs of rare microbes.
It is not necessary to map raw reads onto abundant MAGs. We can also utilize initial contigs to obtain the reads that were not included in the MAGs. However, in this case, more data would be lost, as some portion of contigs had already been discarded by the binning program before MAGs were constructed. Therefore, using contigs instead of MAGs would leave us with only the reads that were not used by either the assembler or the binner.
I have performed exactly what was described above.
A brief background: My study focuses on soil where I suspect there is a high diversity and abundance of a specific microbial group, sulfate-reducing bacteria. Sulfate-reducers are known for their low absolute abundance in soils. However, I have 1) culture-based evidence and 2) 16S-amplicon and 3) shotgun taxonomic profiles and 4) metadata about the soil that strongly indicate the presence of sulfate-reducers. Despite this, the results of MAGs assembly are contradicting. None of the typical sulfate-reducers mentioned in the profiles were assembled using the usual pipeline. Interestingly, classification of dropped reads (those that were dropped during the contig assembly stage) shows the presence of many sulfate-reducers, sometimes even exceeding its number of assembled reads.
Unfortunately, assembly and binning from the dropped reads also failed to identify the sulfate-reducers. It should be noted that this approach did result in new bins with moderate quality, and some microbes assembled this way represent phyla that were absent in the first set of bins obtained using the usual approach. I consider the latter as an indicator of general correctness of the pipeline, as it results in at least a few new bins. I also should note that several shotgun samples were used and quality controlled during my attempts, and binning through usual pipeline yielded in many excellent-quality MAGs. So, again, I believe that not the data posing the problem, but rather their usage.
My questions are:
- Is there a possibility that any of the suggested stages were performed incorrectly or are fundamentally flawed?
- What are your opinions and experiences regarding binning rare microbes?
- How does the depth of sequencing relate to the ability to bin rare taxa? And what depth is considered optimal?
- Will the core idea (of getting rid off the most abundant and less diverse DNA from a sample, and than combine what is left) be helpful in binning rare species from samples with not so great sequencing depth?
Please note that I am a beginner in environmental metagenomics and lack extensive experience. I would appreciate any suggestions or thoughts.
Respectfully, Konstantin.
You suggest to do such normalization so that 1) only the reads presented at <Nx will be used for assembly and so on, or 2) you mean that they will be preserved for future binning, while the others (more abundant) will be discarded (or used for high-abundance bin construction)?
Regardless of the answer, I wanna elaborate on the following. A genome is represented by a number of reads. Even if the genome is not abundant (the corresponding microbial species is rare), isn't it natural to expect that some of its reads are more abundant than the others? So, in this case, picking a specific depth selectivity will result in discarding of some of the genome's reads. Of course, we can try a number of threshold values and choose the one that will result in more rare bins with good quality. But it also means that this threshold search should be repeated every new dataset, doesn't it? Because we don't know how the taxa are distributed along the read's list sorted from more to less abundant. Or we expect this distribution to be the same in case of the same object and DNA extraction-sequencing protocol?
Sorry if my formulations are a bit cumbersome. Thank you very much for the suggestion, I will definitely try it.
Digital normalization will not "delete" any bin. If you do it at 60x, any bin with average sequencing depth < 60x should not be affected at all. Those with depth > 60x will be down-sampled to 60x, but they will still assemble. As to "some parts" of the genome being sequenced at a greater or smaller depth, that shouldn't be an issue because they will still be preserved at their original depth or at 60x, whichever is smaller.
I am not suggesting this normalization should be done for all datasets, but it might help your current case. I have seen a CPR-like MAG that doesn't show up when assembling a full dataset, but pops out when the data is digitally down-sampled.
Beware that abundance can't be reliably estimated from a down-sampled dataset.