First consider the naive non-automatic approach to this question.
Lets use Escherichia coli str. K-12 substr. MG1655 as an example it has the taxonomic ID 511145. In the gnome DB of NCBI I can find this entry:
https://www.ncbi.nlm.nih.gov/genome/?term=txid511145[Organism:exp]
The first line says -> Reference genome: Escherichia coli str. K-12 substr. MG1655. Which links to:
https://www.ncbi.nlm.nih.gov/genome/167?genome_assembly_id=161521
Which is what I want. I can no use the RefSeq Accession to get the genome.
Now lets try this automatically:
esearch -db genome -query "txid511145[orgn]" | esummary
The result is the following:
<DocumentSummary><Id>167</Id>
<Organism_Name>Escherichia coli</Organism_Name>
<Organism_Kingdom>Bacteria</Organism_Kingdom>
<Organism_Group></Organism_Group>
<Organism_Subgroup>Gammaproteobacteria</Organism_Subgroup>
<Defline>A well-studied enteric bacterium</Defline>
<ProjectID>12319</ProjectID>
<Project_Accession>PRJNA12319</Project_Accession>
<Status>Complete</Status>
<Number_of_Chromosomes>3</Number_of_Chromosomes>
<Number_of_Plasmids>10</Number_of_Plasmids>
<Number_of_Organelles>0</Number_of_Organelles>
<Assembly_Name>ASM2632v1</Assembly_Name>
<Assembly_Accession>GCA_000026325.2</Assembly_Accession>
<AssemblyID>1181661</AssemblyID>
<Create_Date>1998/10/13 00:00</Create_Date>
<Options></Options>
<Weight>900</Weight>
<Chromosome_assemblies>484</Chromosome_assemblies>
<Scaffold_assemblies>9772</Scaffold_assemblies>
<SRA_genomes>0</SRA_genomes>
<TaxId>562</TaxId>
</DocumentSummary>
The first problem is that this querry brings me to species level (see TaxId 562) and not to the far more specific level of my query (strain).
The specific question in this case would be how can you link the TaxId 511145 to the genome_assembly_id 161521 using an esearch querry. Obvious the connection is there as I can make the link on the website.
Are you trying to link E. coli taxonomy ID
txid511145
to assembly ID161521
which is a Mycobacterium tuberculosis assembly?Hm not sure whats wrong here but consider my second url
https://www.ncbi.nlm.nih.gov/genome/167?genome_assembly_id=161521
This links to Ecoli. Maybe NCBI has different assembly ids for different DBs?
EDIT: You mean the ID in the NCBI database assembly. I mean the assembly id in the genome database.
While
esearch -db genome -query "167" | esummary
is linking to the right genome as you say the first line on the Ecoli page is linking togenome_assembly_id
you provide a link for. Since there isn't a database for assemblies I wonder if you can't replicate this search via eutils (https://www.ncbi.nlm.nih.gov/genome/167?genome_assembly_id=161521 ) .It looks like there is an assembly database that can be queried. With this query you can still get multiple genomes that equate to E coli of your choice. You probably need to pick one. I don't think there is a best.
As of today it generates:
Uh NCBI is a mess. "Best" is the wrong word, but there is an "official" one. At least one genome is linked with the taxid by the way I described. But this is also probably a little arbitrary and nobody is really checking which assembly should be the representative for the specie. With the way you showed I could simply always take the first entry (which is the newest).
But testing this with other bacteria this query gives not really satisfying results. Example 525338 aka Lactobacillus plantarum subsp. plantarum ATCC. Your query returns three entries for which non links to the entry I would find in the genome db by hand. Which is a far better assembly. The one I get from the assembly db with your query is still in multiple scaffolds.
https://www.ncbi.nlm.nih.gov/nuccore/AZEJ00000000.1/
vs
https://www.ncbi.nlm.nih.gov/nuccore/NC_004567.2
So it seems like I would prefer the genome db instead of the assembly db.
I wouldn't say that NCBI is a mess, but their reasons and criteria for selecting a "reference genome" for each "species" may not be in line with your reasons for wanting a reference genome, and the concept of "species" varies a lot between different types of organisms. Bacteria and viruses don't usually have "species boundaries" that are as easy to define as they are for some eukaryotes such as vertebrates or plants.
Even within the bacteria, the idea of what we name as one "species" in the enterobacteria such as Escherichia coli, is far different than in other types of bacteria such as the Clostridia. There are many Clostridia that got the name "Clostridium botulinum" despite the fact that they are far more diverse than the distance between Escherichia, Salmonella and Shigella.
The NCBI "reference genomes" are duplicates of author-provided genomes. My understanding is that they created these duplicates with NC_nnn accession numbers, so that they (NCBI) could control the annotation whereas only the authors of every other record are in charge of the annotations.
Probably there is a reason that NCBI is what it is and much like in biology just makes sense in the light of history/ evolution.
But still NCBI made the effort to provide the public for every txid of a species or strain (and I don't want to go into the discussion what these two terms mean. For me it is just important that these things simply exist, hence can be grown in a petri dish and are not just a node in a tree) with a reference genome. And thats the only thing I want. I know how to do tha by hand but cant figure out a way how to do it automatically.
If there would be a workaround with the assembly db I would use that but it lacks what the genome db offers, a preselection for a "good" genome. In the case of Lacotbacillus ATCC there is now "good" assembly hence they linked a "better" one from close strain/ specie.
I don't think we are going to solve this dilemma here. There will always be more than one genome for a TaxID. Much as we want things to work automatically NCBI has finite resources and may not have the expertise to decide the "best" representative.
It is easy to gripe about NCBI but remember much of this information may be locked behind paywalls if NCBI did not exist.
But they do it! They provide a reference genome per taxid! I showed that in the question!
You simply take the URL
https://www.ncbi.nlm.nih.gov/genome/?term=txidFOOBAR[Organism:exp]
Than go one the right side to genome. Hit the link in the first line:
Reference genome: FOOBAR
And voila you are there. And the question is how do I do that automatically? I mean I can write a scrip that greps the right link form the HTMLs. But that is ridiculous, there should be a better way.
See if you find @5heikki's solution usable. You can always send a ticket in to NCBI's help desk asking the question above. There is always knowledge that may not have been documented anywhere that the owners know of.