Question

Why we allow contamination during binning?

1

Entering edit mode

2.6 years ago

arshad1292 ▴ 110

Hello,

This question may be trivial for many of you but I am kinda new to metagenomics and confused about the binning. I have seen metagenomic papers where authors used let's say 70% completion and 5% contamination to assemble the metagenomic bin.

I am confused as to why we allow contamination in a bin?

I understand that completion criteria is kept relaxed to get a bin with 70% or so genomic population but why contamination is allowed?

Thanks in advance!

completion MAGs contamination metagenomic metagenome • 906 views

ADD COMMENT • link 2.6 years ago by arshad1292 ▴ 110

score 2 · Answer 1 · 2022-04-16

Not sure if this is just an imprecise wording, or if you really don't understand how difficult metagenomic binning is. Either way, the answer is that contamination in most cases can't be helped. It is not something we can always control.

One can ask why people allow a cake in the oven to get burned. That can be because of negligence if one forgets to check the doneness or completely ignores the timer. But there can be objective reasons: it is easier to burn a chocolate cake because it is dark already, so it is more difficult to tell when it burns a little. It is also easier to burn a cake in a dish that doesn't heat up uniformly.

For metagenomic binning we often start with a DNA sample from an unknown population. Not only do we not know the composition of that population (it could be 3 members or 300 members), but we also have no idea about their abundances (some of them can be 50% of the total and others can be 0.01%). Some of them are closely related, and others are completely different. When that combination of DNA sequences is sequenced, many of them are of such low abundance that they will not be sequenced at adequate depth for proper assembly. If we force the issue and do really deep sequencing, then sequencing errors will complicate the assembly. Finally, many of them are closely related - therefore have near-identical DNA sequence - and we will have to separate them as well. This is to say that contamination in metagenomic bins is rarely a result of human negligence (when we "allow" it), and most of the time it is either the community's complexity or the imperfections of collection, sequencing methods, and assembly. We can't do much to fix that. Finally, subsequent evaluation of bins is also not perfect, and often deems them "contaminated" and "incomplete" even though those are relative terms.

Rather than telling you about all the objective reasons why metagenomic binning is not easy and why contamination is often inevitable, I will show below a real-life example of a medium-complexity microbial community (this one is below 100 members, and there are communities with more than 300 members). Tetranucleotide frequencies of DNA sequences were used to embed them in 2D space, and they were subsequently grouped by density-based clustering. Hopefully you will be able to see below that there are many groups on the periphery that are well clear of everything else, and most of them have little or no contamination. However, there are many groups in the the middle (for example, 87-89) where it is very difficult to make out individual species.

enter image description here