I originally posted this question on Bioinformatics Stack Exchange, but have yet to receive and answer, so I'm crossposting here.
Using the new NCBI Datasets platform, you can browse the collection of genomes associated with one or more taxa. For example, searching Pseudomonas aeruginosareturns 19,878 genomes as of 29 March 2023.
In the search filtering tab, they give the option to "Exclude atypical genomes". Applying this option to the previous search returns 19,737 genomes, implying that 141 genomes were discarded as atypical.
What criteria are they using to label genomes "atypical"? I've read the documentation for the NCBI Prokaryotic Genome Annotation Process, and my hunch is that atypical genomes are those that are annotated with atypical genes, though I'm not sure.
You can find more information about what constitute an atypical genome in this FAQ. Here's the list of reasons for a genome to be marked as atypical:
Datasets flags genomes with a warning icon and message for the following genome problems:
chimeric - sequences from two different organisms are joined together.
contaminated - sequences from another organism, cloning vectors, linkers, adapters or primers are present in the assembly.
genome length too large - total non-gapped sequence length of the assembly is more than 1.5 times that of the average for the genome in the Assembly resource from the same species, more than 15 Mbp, or is otherwise suspiciously long.
genome length too small - total non-gapped sequence length of the assembly is less than half that of the average for the genomes in the Assembly resource from the same species, less than 300 Kbp, or is otherwise suspiciously short.
hybrid - sequences from a hybrid between different species, strains or isolates.
low quality sequence - long stretches of the sequence have a high proportion of ambiguous bases, are low complexity, or have some other indication that the sequence quality is low..
mixed culture - sequences come from two or more organisms that were not cultured separately.
misassembled - alignment to related genome assemblies or other evidence indicates the assembly is likely to have errors.
partial - the assembly has only partial genome representation.
sequence duplications - assembly has one or more large duplications.
unverified source organism - the origin of the assembly is misidentified.
Please let us know if you have any additional questions.
I think is explained here: Assembly Anomalies and Other Reasons a Genome Assembly may be Excluded from RefSeq
If you get a different answer, please post their response here