There are two primary methods for generating multiple sequence alignment files necessary for constructing bacterial phylogenetic trees: the core SNP method and the core genome tree method. The core SNP alignment file is typically produced using variant calling programs such as Snippy, which requires a reference genome. This file includes "core SNPs" found in both intra- and inter-genetic regions, provided that these loci are present in all strains under consideration. However, the question arises whether such core SNP alignments are suitable for building maximum likelihood trees with tools like RAxML or IQ-TREE, as these methods rely on specific nucleic acid or amino acid substitution models. Simply extracting SNP loci might not align with the assumptions underlying these models.
The second approach involves concatenating core genome alignments, which can be created using software like Roary or Panaroo. Since the resulting core genome alignment file may be quite large, one might wonder if the calculation process could be expedited and the memory load reduced by extracting non-monomorphic sites from the file, for instance, using the tool snp-sites. Would employing such a streamlining strategy, which could potentially lighten the data load, alter the results or the structure of the resulting phylogenetic tree?