Question

Genome completeness and contamination: CheckM vs CheckM2

0

Entering edit mode

7 months ago

beacamara • 0

Dear all,

I am working with single-cell genomes (SAGs) from archaeas. I would like to determine the completeness and contamination of these genomes. For that, I have used CheckM and the completeness obtained for most of them is not >90% and contamination is > 5%. It is highly likely that some of the SAGs are contaminated with fragments of other genomes. In some SAGs, the predicted completeness is very low because CheckM is using general marker sets (domain markers) and probably because these genomes belong to taxa that are not well represented in the CheckM database.

For these reasons, I was considering using CheckM2 because it uses machine learning models but it is based on KEGG annotations. At this moment, there are several disadvantages of using KEGG database including that is not a public database and the KEGG version that CheckM2 is using is an out-of-date version (KEGG 2018).

Could anyone help me to determine which option could be better?

Thanks in advance
Bea

genome-completeness contamination • 1.2k views

ADD COMMENT • link 7 months ago by beacamara • 0

score 2 · Accepted Answer · 2024-04-03

2

Entering edit mode

7 months ago

Mensur Dlakic ★ 28k

In some SAGs, the predicted completeness is very low because CheckM is using general marker sets (domain markers) and probably because these genomes belong to taxa that are not well represented in the CheckM database.

I don't think so. Not too long ago I analyzed ~250 SAGs, and <10% of them were >50% complete. It is well known that most SAGs end up being incomplete. If you have some DPANN members (those with reduced genome sizes) their completeness might be underestimated by ~30% with CheckM, yet I doubt all of the SAGs in your dataset are in that category.

To answer your final question: if you are asking whether to use CheckM or CheckM2 because you are hoping that one of them will give you higher completeness than the other, I don't think that will be the case except that CheckM2 may give higher completeness for DPANN SAGs. Other than that I think you can't go wrong with CheckM as it has been around for a long time. I still prefer the format of its output to that of CheckM2. Still, the best option may be to use both approaches and decide from the results which one works better for you.

ADD COMMENT • link 7 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Thank you so much for your quick answer. You are completely right. These are DPANN members and probably with reduced genome sizes. In my CheckM analysis I have also included published genomes from the same taxonomic group in order to have a reference in % completeness. I will compare results from both CheckM and CheckM2. I would also like to know which gene markers are present, not present or if they are present in multicopy. Is there a way to obtain this information from the CheckM internal files and CheckM results?

Thank you so much

ADD REPLY • link 7 months ago by beacamara • 0

1

Entering edit mode

Both programs create directories where they store intermediate files during the run. Some of those file will have the information about the presence of individual markers, at least for CheckM. Don't remember exactly for CheckM2.

ADD REPLY • link 7 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Thank you so much. Looking into these files I have found a lot of information. Thank you again for you precious help. It was very important for me. I am trying to figure out the way to take advantage of all the information.

ADD REPLY • link 7 months ago by beacamara • 0