I want to analyse H.Pilory Genomes that were assembled from isolates, the strain is unknown. There are 100+ reference genomes from different strains available. From pan-genomics we know, that H.P. has an open Pangenome and many strains only share a small amount of genes between each other. How could one evaluate the quality of the assembly without knowing the closest reference genomes? Are there metrics to evaluate the completeness of the assembly without taking the reference into account - would mapping coverage suffice? also how could SNPs be compared, when the assemblies can potentially have different reference genomes ? Only compare the core genes ? I could not find suitable literature that explains the implications for multiple reference genomes for SNP calling and assembly evaluation. Any hints welcome!
As in you have assembled multiple unknown strains? Or you want to look at the available strains for your sequenced samples of unknown strain?
BUSCO is a useful metric that assemblies typically report regarding estimated completeness of the transcriptome based on nearly universal single copy orthologs. Often you can get physical estimates of genome sizes that can be compared to golden path lengths too.
Regarding metrics if you want to do some metagenomic analyses, you should notice a typically small but significant improvement in mapping efficiency and decrease in missing genotypes when using a more phylogenetically related reference genome. Here is a paper I wrote on the topic with more fleshed out implications of reference genome choice, albeit in vertebrates, but some of the implications should hold true for prokaryotes too.
Excetly, we have assembled multiple unknown strains.
Thank you very much, I will use BUSCO metrics for comparison of assembly quality. However, it is still not clear what would be the best way to compute SNPs to compare those strains. Use the phylogenetically closest reference for all strains ? But how to find the closest one for all of them? Map each to all known references and compare the mapping stats or building a phylogenetic tree using e.g. using the core genes ? One major problem seems to be, that these strains do only share some genes, can I assume that only the genes that are present in all strains can be used for SNP calling ?
I'm not familiar with the state of tools for whole genome phylogenies, but they do exist. If you've annotated the genomes, you could use something like OrthoFinder, which infers gene trees, and then an overall phylogeny from the gene trees. I'm not sure how robust or accepted this methodology is in the literature, so you may need to look into it, but it sounds reasonable to me. If H. pylori has a known loci used to distinguishing between strains, that could also be an easy solution, though these are usually to delineate species in a complex rather than strains.
If you identify SNPs in syntenic regions, you can make direct comparisons of variants. I used to use a tool called Satsuma2 for synteny. I liked it because it was k-mer based, whereas most synteny tools are anchored around genes.
You're also certainly not the first to publish multiple unknown strains in a single paper. I'm sure with some searching you can find similar papers and find out what approaches they used.