To add to Joe's answer, I'd recommend CompareSketch in BBTools, which does both whole-genome AND 16S comparisons.
For example, I put 4 bacteria in a directory (after renaming them with the taxid for convenience):
ls
tid_1121402_Desulfobulbus_elongatus.fna.gz tid_869814_Desulfobulbus_alkaliphilus.fna.gz
tid_1391911_Micrococcus_aloeverae.fna.gz tid_993416_Micrococcus_cohnii.fna.gz
Then I ran this:
comparesketch.sh alltoall *.fna.gz format=3
Indexed 38716 unique and 38921 total hashcodes.
Loaded 4 sketches in 0.840 seconds.
#Query Ref ANI QSize RefSize QBases RBases QTaxID RTaxID KID WKID SSU
tid_1121402_Desulfobulbus_elongatus.fna.gz tid_869814_Desulfobulbus_alkaliphilus.fna.gz 80.488 3899064 4095408 3961953 4201268 1121402 869814 0.243 0.255 95.269
tid_1391911_Micrococcus_aloeverae.fna.gz tid_993416_Micrococcus_cohnii.fna.gz 86.888 2416945 2275564 2419301 2275595 1391911 993416 1.979 2.103 98.106
tid_869814_Desulfobulbus_alkaliphilus.fna.gz tid_1121402_Desulfobulbus_elongatus.fna.gz 80.492 4095408 3899064 4201268 3961953 869814 1121402 0.243 0.255 95.269
tid_993416_Micrococcus_cohnii.fna.gz tid_1391911_Micrococcus_aloeverae.fna.gz 86.885 2275564 2416945 2275595 2419301 993416 1391911 1.979 2.103 98.106
Ran 8 comparisons in 0.055 seconds.
Total Time: 0.895 seconds.
The results show you the query, ref, and then other stuff. Most notable are the ANI column - which is the approximate ANI estimated from kmer instersection - and the SSU column - which is the exact identity of the 16S via alignment. The WKID is the percent identity in kmer space.
Unfortunately, I can't guarantee that it will always find the 16S, especially if it is only present in the assembly as a partial fragment, but it does a pretty good job.
Hi,
I'd go for the 16s sequences. Even if the 16s are not in the database per se, you should be able to extract them from the genome in many cases, see here: https://bioinformatics.stackexchange.com/questions/11489/looking-for-a-tool-to-find-16s-rrna-in-hundreds-of-genomes for example. A second possible approach is to identify 1:1 orthologues as annotated AA sequences from the species using e.g. prokka and orthofinder, align them and concatenate the alignments to make a phylogeny. I wouldn't recommend aligning whole genomes of unrelated species on the DNA level using standard MSA tools. Even if they could handle the amount of data the sequences are too divergent and these tools do not deal with inversions.
Thank you! I will report back once I have done this.
It's important to know why you want to generate a species tree. What question are you trying to answer?
It is to showcase the relatedness of those species and strains in a project paper. It is not necessarily to answer a question but to make it easier or faster to understand. My supervisors want them also.