Confused about bin completion and contamination
1
0
Entering edit mode
2.5 years ago
arshad1292 ▴ 110

Hello,

I ran a binning tool and assessed completion and contamination with checkM. Now I am confused about bin completion and contamination. For example, I have a bin that is 70% complete and 5% contaminated. If this makes up a total of 75% then what's rest of 25% in my bin?

Please help me in understanding this.

Many thanks in advance!

metagenomics bins metagenome • 1.8k views
ADD COMMENT
2
Entering edit mode
2.5 years ago
Mensur Dlakic ★ 28k

Bin completeness and contamination in CheckM are estimated based on the presence of single-copy gene markers that are almost universally shared between prokaryotes. If I remember correctly, 122 archaeal and 120 bacterial markers.

In your case 70% completeness means that CheckM found that fraction of universal markers (approximately 90) in a bin. Similarly, 5% contamination means that ~6 markers were found in multiple copies.

A simple reason a bin is only 70% complete is that literally the rest is missing. That could be either because of sample preparation, assembly, or binning. Or it could mean that you are dealing with an organism that has a reduced genome, and that 70% is all it is supposed to have. It is impossible to know for sure until you perform a taxonomic classification. For Nanoarchaea and Aenigmarchaea, 70% completeness could actually mean that they are fully complete. Similarly for CPR members in Bacteria, 75-80% completeness is typically all they have.

The contamination you have is not bad if this is a metagenomic sample. It usually means that your bin got contaminated with DNA that doesn't belong and results in multiple marker copies. Some prokaryotes do carry multiple copies of what are normally single-copy marker genes. Finally, sometimes sequencing errors due to extremely high depth can cause a single genome to come out looking contaminated.

ADD COMMENT
0
Entering edit mode

Thank you for your detailed answer. This really clarifies things to me. Just one last question: you said that "It is impossible to know for sure until you perform a taxonomic classification" so I did perform taxonomic classification with kraken 2 at the family level. It gave me different families representing the same bin. What else we can know from this classification particularly about its completion and contamination that you referred? Many thanks!

ADD REPLY
1
Entering edit mode

I recommend GTDBTk for bin classification (from the same group that made CheckM). It will automatically find the most confident taxonomic level at which your bins can be classified, which sometimes can be a species. If you get a taxonomic group that is normally expected to have 100% of single-copy gene markers, then you will know that your bins with 70% completeness are truly incomplete. If your bins get classified as one of reduced genome groups (I outlined only some of them above), that could mean that they are complete. CheckM has a different set of markers for assessing the completeness of CPR genomes.

https://github.com/Ecogenomics/CheckM#estimating-quality-of-cpr-genomes

ADD REPLY

Login before adding your answer.

Traffic: 2656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6