I'm currently trying to computerize a search which is:
1) get all genomes on NCBI related to certain organism + refseq and so on. i'm doing that with Biopython and Entrez
query = "Microbacterium[Organism] AND latest_refseq[filter] NOT partial[filter]"
handle = Entrez.esearch(term=query, db="Assembly", retmax=600)
ids = Entrez.read(handle)["IdList"]
here : 513 genomes
2) second part would be to add another filter to get only the assemblies linked to a published paper but i have no idea how i could to that. Unfortunatly i can't see some usefull tags related to publication for assembly db in biopython but maybe others ways exist? i'm open to every way not just python/biopython
thanks !
yeah thanks bu i don't need help for the first part, i know how to dl my genomes but not how to add informations about the fact that every genome is related to a published paper or not
Using EntrezDirect. It may not always work for all accessions. In your case only 41 results from original search seem to be linked to a paper.
$ esearch -db assembly -query "Microbacterium[Organism] AND latest_refseq[filter] NOT partial[filter]" | elink -target pubmed | esummary | xtract -pattern DocumentSummary -element Id,Title,Value
35331789 Genome sequencing of a novel Microbacterium camelliasinensis CIAB417 identified potential mannan hydrolysing enzymes. 35331789 10.1016/j.ijbiomac.2022.03.093 S0141-8130(22)00563-3
34371613 Identification of Plant Growth Promoting Rhizobacteria That Improve the Performance of Greenhouse-Grown Petunias under Low Fertility Conditions. 34371613 PMC8309264 pmc-id: PMC8309264; 10.3390/plants10071410 plants10071410
34225488 Poor Competitiveness of Bradyrhizobium in Pigeon Pea Root Colonization in Indian Soils. 34225488 PMC8406239 pmc-id: PMC8406239; 10.1128/mBio.00423-21
34022615 Bacteria of eleven different species isolated from biofilms in a meat processing environment have diverse biofilm forming abilities. 34022615 10.1016/j.ijfoodmicro.2021.109232 S0168-1605(21)00191-4
33578887 Comparative Metabologenomics Analysis of Polar Actinomycetes. 33578887 PMC7916644 pmc-id: PMC7916644; 10.3390/md19020103 md19020103
Note: While I show the example as one step search you may need to do this in two steps if you want to keep track of accession numbers and ones that actually produce a result. Piping in EntrezDirect does not keep track of the query across the pipes so you will need to do this yourself.
How to download all Pseudomonas aeruginosa Genomes from NCBI Genomes database?
How to download specific genomes
How to download genome assemblies from NCBI with a list of GCA identifiers?
downloading genomes in fasta format from accession ids
yeah thanks bu i don't need help for the first part, i know how to dl my genomes but not how to add informations about the fact that every genome is related to a published paper or not