How to retrieve metadata for multiple genbank records
1
0
Entering edit mode
25 days ago
SushiRoll ▴ 140

Hey everyone!

I'm working with a set of assembled genomes retrieved from the Genbank (https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=287&annotated_only=true&refseq_annotation=true). I've easily downloaded the sequences but now I would like to get a file with their metadata, especially collection date and isolation source. The "Select columns" button allows me to add some additional metadata but mostly related to sequencing and assembly parameters. Is there a simple way to retrieve the values I need?

Thanks!

metadata Genbank • 435 views
ADD COMMENT
1
Entering edit mode
25 days ago
GenoMax 153k

Using EntrezDirect. Output is a key,value pair.

$ esearch -db assembly -query GCF_023093935 | elink -target biosample | efetch -format xml | xtract -pattern BioSample -element accession -block Attributes -group Attribute -element Attribute@attribute_name Attribute |  tr '\t' ',' 
strain,34Pae36,collected_by,LGMB,collection_date,2017-04,geo_loc_name,Colombia: Bogota,host,Homo sapiens,host_disease,Bacterial infectious disease,isolation_source,collection,lat_lon,4.70 N 74.10 W

$ esearch -db assembly -query GCF_026727755 | elink -target biosample | efetch -format xml | xtract -pattern BioSample -element accession -block Attributes -group Attribute -element Attribute@attribute_name Attribute |  tr '\t' ',' 
strain,C4.2,isolation_source,rubber door sealing of a washing machine,collection_date,2019-10-22,geo_loc_name,Germany: Bielefeld,sample_type,pure culture,biomaterial_provider,Kaltschmidt Lab, Bielefeld University, Germany,collected_by,Ehsan Asghari, Christian Kaltschmidt,identified_by,Ehsan Asghari, Annika Kiel

You can use epost solution for a list of multiple ID's (one per line in a file) but that may generate error lines with some of the samples that have no information and/or if the queries happen too quickly.

$ epost -db assembly -input id_file | elink -target biosample | efetch -format xml | xtract -pattern BioSample -element accession -block Attributes -group Attribute -element Attribute@attribute_name Attribute |  tr '\t' ',' 
ADD COMMENT
0
Entering edit mode

Excellent, I went for the esearch solution since I'm expecting missing information in most of the samples and don't want to deal with that mess afterwards. I just implemented a for loop for the query ID's, maybe not very elegant but helped me with the retrieval from multiple samples.

Thanks a lot!

ADD REPLY

Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6