Dear all,
I am working with single-cell genomes (SAGs) from archaeas. I would like to determine the completeness and contamination of these genomes. For that, I have used CheckM and the completeness obtained for most of them is not >90% and contamination is > 5%. It is highly likely that some of the SAGs are contaminated with fragments of other genomes. In some SAGs, the predicted completeness is very low because CheckM is using general marker sets (domain markers) and probably because these genomes belong to taxa that are not well represented in the CheckM database.
For these reasons, I was considering using CheckM2 because it uses machine learning models but it is based on KEGG annotations. At this moment, there are several disadvantages of using KEGG database including that is not a public database and the KEGG version that CheckM2 is using is an out-of-date version (KEGG 2018).
Could anyone help me to determine which option could be better?
Thanks in advance
Bea
Thank you so much for your quick answer. You are completely right. These are DPANN members and probably with reduced genome sizes. In my CheckM analysis I have also included published genomes from the same taxonomic group in order to have a reference in % completeness. I will compare results from both CheckM and CheckM2. I would also like to know which gene markers are present, not present or if they are present in multicopy. Is there a way to obtain this information from the CheckM internal files and CheckM results?
Thank you so much
Both programs create directories where they store intermediate files during the run. Some of those file will have the information about the presence of individual markers, at least for CheckM. Don't remember exactly for CheckM2.
Thank you so much. Looking into these files I have found a lot of information. Thank you again for you precious help. It was very important for me. I am trying to figure out the way to take advantage of all the information.