Hi all,
I was wondering if there's a streamlined way of getting the "representative" genome for a particular species. What I'm looking for is an automated way of retrieving a genome assembly and annotation for all bacterial species in, e.g. Kraken2 output. The most popular databases for Kraken2 are made from RefSeq, so I'd imagine there should be a relatively easy way to match Taxonomy ID/species name to a RefSeq ID.
Any advice would be appreciated, as always!
All the best
-- Alex
I think the idea is to pull "the representative" genome. Other than maybe PAO1 for pseudomonas, I can't think of "the" representative strain for bacterial species...
Every bacterial species will likely have one strain that is used more often. e.g. Escherichia coli str. K-12 substr. MG1655 and others that are in the collection above.
I think this question is asking for both (1) a list of accessions giving the "one strain that is used more often" for each species and (2) a method of downloading the associated genomes. So far I have only seen answers for (2); but I would be interested in the answer to (1) for my own edification.
If you visit the list I linked above you can download the accession numbers of assemblies that NCBI has put together in the RefSeq collection. Use the drop down to change from Summary to ID table. You can download the list by sending it to a file. NCBI's selection of a strain/genome for each organism is likely human curated but may not be perfect.
Thanks for clarifying. I interpreted "There are 15507 assemblies that represent 236000 prokaryotic RefSeq genome collection as of early 2022" as meaning you could pull out 15507 assemblies for unique species; not that there was already an ontology term "representative_genome" (and "reference_genome") indicating that manual/automated curation had been performed. That was the missing link.