Question

Fetch relevant metadata from accession numbers

2

Entering edit mode

5.7 years ago

lucslapping ▴ 20

I was wondering which ways are available for getting metadata from accession numbers. I have seen other tools such as Nextstrain make use of a so called "metadata" file to describe used sequences. The file looks something like this: Metadata for sequences

https://imgur.com/a/uL3m7T5

It shows various data from NCBI for the accession numbers such as virus strain, country, date, URL, etc. For me the most import ones are strain, country and date. Are there ways to download such data automatically when you have a list of accession numbers?

Any help is appreciated.

R ncbi metadata accession number nextstrain • 2.6k views

ADD COMMENT • link updated 5.7 years ago by GenoMax 152k • written 5.7 years ago by lucslapping ▴ 20

1

Entering edit mode

5.7 years ago

JC 13k

Use Entrez Direct tools

ADD COMMENT • link 5.7 years ago by JC 13k

score 3 · Accepted Answer · 2019-11-21

3

Entering edit mode

5.7 years ago

GenoMax 152k

Using EntrezDirect :

$ esearch -db nuccore -query "KY317939" | esummary | xtract -pattern DocumentSummary -element SubName
ZIKV/Homo_sapiens/Colombia/2016/ZC204Se|Homo sapiens|Colombia|serum|06-Jan-2016|Antibody Systems Inc

Fields you are getting above are (separated by |)

isolate|host|country|isolation_source|collection_date|collected_by

ADD COMMENT • link 5.7 years ago by GenoMax 152k

0

Entering edit mode

Thanks you, this brings up some desired fields that I mentioned, however is there a way I can submit a list of accession numbers and save the output to a csv, tsv or txt file?

ADD REPLY • link 5.7 years ago by lucslapping ▴ 20

0

Entering edit mode

Use epost with your accession numbers of interest in a file (one per line).

$ epost -db nuccore -format acc -input acc| esummary | xtract -pattern DocumentSummary -element Caption,SubName | sed 's/|/\,/g'
MF574578        ZIKV/Homo sapiens/COL/PRV_00028/2015,Homo sapiens,C6/36 cell-derived; 5 passages in Vero followed by one passage in C6/36; passage history: Vero (5), C6/36 (1),Asian,Colombia: Barranquilla,Dec-2016
MF574562        ZIKV/Homo sapiens/COL/FLR_00008/2015,Homo sapiens,Vero cell-derived; 3 passages in C6/36 followed by 4 pasages in Vero; passage history: C6/36 (3), Vero (4),Asian,Colombia: Barranquilla,Dec-2015
KY558989        ZIKV/Homo_sapiens/Brazil/2015/ZBRA105,Homo sapiens,Asian,Brazil: Joao Camara, Rio Grande do Norte,23-Feb-2015,ZiBRA team
KY317939        ZIKV/Homo_sapiens/Colombia/2016/ZC204Se,Homo sapiens,Colombia,serum,06-Jan-2016,Antibody Systems Inc

ADD REPLY • link 5.7 years ago by GenoMax 152k

0

Entering edit mode

Thanks again, this worked for me, however some records appear to be in the wrong order for my case. Could this be due to mistakes in the database?

ADD REPLY • link 5.7 years ago by lucslapping ▴ 20

0

Entering edit mode

What do you mean by wrong order? Can you provide an example? We are doing a direct databaseq query so the information should be what is in the db.

ADD REPLY • link 5.7 years ago by GenoMax 152k

0

Entering edit mode

I have a input text file with accession numbers and here is what the first few lines look like:

MK419834.1

MK230890.1

MK230891.1

MK230892.1

MK230893.1

In the output CSV file I see that some entries dont have all 6 fields that you specified:

isolate|host|country|isolation_source|collection_date|collected_by

Some entries only have 4 out of those 6 fields for example. In the CSV output I see for certain entries that the country is in the second column and that the host is in the third column, this is a different order than what most entries have in the output file. I would like to have each result in the right column basically.

ADD REPLY • link 5.7 years ago by lucslapping ▴ 20

0

Entering edit mode

Unfortunately it is possible that blank fields from some of those records are messing up the output. You could leave the output as is, bring the data into excel (breaking records on |) and then check if the fields stay aligned.

ADD REPLY • link 5.7 years ago by GenoMax 152k