Question

How to retrieve metadata for multiple genbank records

0

Entering edit mode

25 days ago

SushiRoll ▴ 140

Hey everyone!

I'm working with a set of assembled genomes retrieved from the Genbank (https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=287&annotated_only=true&refseq_annotation=true). I've easily downloaded the sequences but now I would like to get a file with their metadata, especially collection date and isolation source. The "Select columns" button allows me to add some additional metadata but mostly related to sequencing and assembly parameters. Is there a simple way to retrieve the values I need?

Thanks!

metadata Genbank • 435 views

ADD COMMENT • link 24 days ago by SushiRoll ▴ 140

score 1 · Accepted Answer · 2025-08-06

Using EntrezDirect. Output is a key,value pair.

$ esearch -db assembly -query GCF_023093935 | elink -target biosample | efetch -format xml | xtract -pattern BioSample -element accession -block Attributes -group Attribute -element Attribute@attribute_name Attribute |  tr '\t' ',' 
strain,34Pae36,collected_by,LGMB,collection_date,2017-04,geo_loc_name,Colombia: Bogota,host,Homo sapiens,host_disease,Bacterial infectious disease,isolation_source,collection,lat_lon,4.70 N 74.10 W

$ esearch -db assembly -query GCF_026727755 | elink -target biosample | efetch -format xml | xtract -pattern BioSample -element accession -block Attributes -group Attribute -element Attribute@attribute_name Attribute |  tr '\t' ',' 
strain,C4.2,isolation_source,rubber door sealing of a washing machine,collection_date,2019-10-22,geo_loc_name,Germany: Bielefeld,sample_type,pure culture,biomaterial_provider,Kaltschmidt Lab, Bielefeld University, Germany,collected_by,Ehsan Asghari, Christian Kaltschmidt,identified_by,Ehsan Asghari, Annika Kiel

You can use epost solution for a list of multiple ID's (one per line in a file) but that may generate error lines with some of the samples that have no information and/or if the queries happen too quickly.

$ epost -db assembly -input id_file | elink -target biosample | efetch -format xml | xtract -pattern BioSample -element accession -block Attributes -group Attribute -element Attribute@attribute_name Attribute |  tr '\t' ','