Hi,
For a large-scale genome analysis, I am looking a set of bacterial reference genomes. An ideal set would contain around 1-10k whole genomes that cover a broad range of bacteria, have a certain dissimilarity to each other and do not cover the same organisms multiple times (i.e. not multiple strains per organism).
Sequence databases offer sequences in large quantities of course, but I am unsure how to select a sensible subset. However, I feel like a lot of people must have had similar problems in the past. Is anyone aware of a) any data collection that might fit my needs, or b) a piece of work dealing with how to choose reference sequences?
You can find the current list of baterial genomes available at NCBI here. Since the choice of genomes is somewhat subjective, you may need to decide what combination will work for you.
Isn't ncbi refseq what you are looking for ?
Or maybe you can look into previous consortium effort to get non-redundant DB, such as HMP.
Nb : In case you want to download bacterial refseq, here is the recipe :
Ps : Be aware that this is going to take some time to DL 100+ Go of data
Not quite. Since following was listed as a requirement.
Hi, same issue here, did yo end up finding a way to get this set? thanks
My comment above has a for list of current bacterial genomes.