Hi everyone,
This may be a silly question, but I am interested if using metagenome assembly and binning is a valid method of determining if a sample contains a mixture of species. Similarly, can metagenomics be used to identify and remove contamination from a single genome?
For some background, I was recently asked to assemble bacterial genomes from five samples sequenced on Illumina MiSeq. I was told that these samples were pure, so I ran the raw reads through a quick de novo draft genome assembly pipeline: BBDuk for adapter removal and quality filtering > SPAdes in “isolate” mode > Quast and lineage-specific CheckM. It seemed that the draft genomes were of okay quality.
However, I was a bit concerned by the level of contamination. I was later told that that the five samples were actually enrichments of a specific genus of bacteria from an environmental sample; there was therefore no guarantee that each sample contained only a single species. I also learned that additional Illumina sequencing had been conducted on other passages from the original 5 samples, and in some of these runs less than 50% of the reads mapped to the draft genomes I had generated.
This all led me to suspect that the original samples could contain a mixture of species, but I didn’t have access to reference genomes and therefore couldn’t use something like BBSplit to parse the raw reads. I thought that if the samples were mixed species then metagenome assembly and binning could generate MAGs for each species in the sample. So I ran the following pipeline on the reads from the original 5 samples: BBDuk with the same parameters as before > MetaSPAdes > MetaQuast > MetaBAT2, MaxBin2, and CONCOCT for binning > DAS Tool for optimizing bins > Quast and lineage-specific CheckM. For each sample, MetaQuast BLASTn identified two references: a species belonging to the enriched genus and a genome of Streptococcus pneumoniae. However, none of the contigs in any sample aligned to S. pneumoniae. DAS Tool identified only a single bin for each sample, and these bins had >98% ANI to the corresponding original draft genomes. Quast indicated that the bins generally had fewer contigs than the corresponding original draft genomes, and I assume that this explains the majority of the differences in total length, GC content, etc. CheckM found that for most of the samples, the bins had slightly lower completeness and slightly lower contamination.
Based on the MetaQuast and DAS Tool output, I think that luckily the original enrichments were mostly pure. This probably means that, in the passages with low mapping to the draft genomes, the other species had grown to higher proportions.
So is using a metagenomics pipeline a valid means of determining if the original samples contained a mixture of species? This is again assuming that using something like BBSplit is not possible due to lack of reference genomes. If so, can I be reasonably confident from my analysis that the original samples contain primarily a single species? Additionally, does metagenomic assembly and binning work to remove contaminant contigs? If so, should I consider the bins/MAGs I generated to be “better” than the original draft genomes because they tend to have fewer short contigs and less contamination (though the bins/MAGs tend to have lower completeness)?
Thanks in advance!
It's not a silly question! And yes, it can be done. JGI is currently testing various binning tools to try to find the best protocol for this kind of decontamination. We get a lot of situations where particularly a fungus is mixed with some associated bacteria, or an "enrichment" ended up with 2 or 3 species. But, I don't have any recommendations yet.
I generally use SendSketch to determine if a library appears to contain multiple species since it's fast and can run before any other processing. It also gives completeness and contamination estimates, though their accuracy depends on how closely related the sample is to something already in the reference set (RefSeq).
One of the other things we do at JGI to identify likely contamination in the assembly is a gc/coverage plot; basically, map the reads to the contigs and plot a 2D scatter plot with contig GC on one axis and average coverage on the other. We also color the dots by taxa of best BLAST (or Sketch) hit of the contig. So most of the time, when there is contamination (or an organelle) you will see two little clouds of dots at different GC and coverage, with many of the dots having the same color. Not only can this be used to identify contaminated assemblies, but in many cases it can separate the organisms - particularly when you have low-level contamination and can simply use a coverage cutoff, or the organisms have very different GC. For more complex cases a dedicated binning tool could be better.