How can you *consistently* download BioProject IDs from NCBI's BioSample database using Entrez Direct?
2
0
Entering edit mode
4.4 years ago
millere • 0

I am trying to download records from NCBI's BioSample database using Entrez Direct. I'm having particular issues with getting the BioProject ID(s) associated with some, but not all BioSample records. I've found that sometimes the BioProject ID is found in the "Links" block of the XML object, which prompted me to write the following:

esearch -db biosample -query SAMN04362913 | efetch -format docsum | xtract -pattern BioSample \
-SRA "(NA)" \
-block Id -if Id@db -equals "SRA" -SRA Id \
-block Ids -first Id -element "&SRA" \
-DATE "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "collection_date" -DATE Attribute \
-block Attributes -element "&DATE" \
-LOC "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "geo_loc_name" -LOC Attribute \
-block Attributes -element "&LOC" \
-HOST "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "host" -HOST Attribute \
-block Attributes -element "&HOST" \
-block Link -if Link@target -equals "bioproject" -tab "/" -element Link@label

For which I get as an output:

SAMN04362913 SRS1219238 None United Kingdom: None Homo sapiens PRJNA248792

However, I've discovered that this doesn't work for all BioSamples. Specifically, all BioSamples starting with "SAME" (from ENA/EBI) and some BioSamples starting with "SAMD" (from DDBJ) do not output the BioProject ID(s). For example, on NCBI's BioSample SAMEA5548256 webpage, the BioProject ID is listed as PRJEB30317, but when I run the above code, I get the following:

SAMEA5548256 ERS3350306 NA NA NA

Upon closer inspection, it appears that the "Links" block of the XML object is missing entirely despite a BioProject ID being present on the website.

Anyone know why the "Links" block is absent from the XML object for some samples? Is there a way around this so I can pull the BioProject ID(s) for any sample?

Any help would be much appreciated! Thank you!

UPDATE: I can't use the SRA database instead of the BioSample database because 1) I want a bunch of sample collection metadata that's only included in the BioSample database (e.g. host, collection date, etc.) and 2) many samples I want info on do not have corresponding entries in the SRA database (e.g. SAMN10656824). I had shortened the code above for ease of reading, but I updated it to reflect my need for sample metadata.

XML ncbi entrez-direct • 2.3k views
ADD COMMENT
1
Entering edit mode
4.4 years ago
GenoMax 147k

How about following

$ esearch -db sra -query "SAMEA5548256" | efetch -format runinfo -mode xml | xtract -pattern SraRunInfo -element BioProject
PRJEB30317
$ esearch -db sra -query "SAMN04362913" | efetch -format runinfo -mode xml | xtract -pattern SraRunInfo -element BioProject
PRJNA248792
ADD COMMENT
0
Entering edit mode

That would work except that 1) I want a bunch of sample collection metadata that's only included in the BioSample database and 2) Many samples I'm looking at do not have corresponding entries in the SRA database.

ADD REPLY
0
Entering edit mode

Can you post examples? Also what kind of metadata are you looking at?

ADD REPLY
0
Entering edit mode

I just updated my post. My apologies for leaving it out originally!

ADD REPLY
0
Entering edit mode

The FULL code I'm currently using is:

esearch -db biosample -query "Infantis AND Salmonella enterica [ORGN]" | efetch -format docsum | xtract -pattern BioSample \
-NAME "(NA)" \
-block Id -if Id@db_label -equals "Sample name" -NAME Id \
-block Ids -element "&NAME" \
-CFSAN "(NA)" \
-block Id -if Id@db -equals "CFSAN" -CFSAN Id \
-block Ids -element "&CFSAN" \
-SRA "(NA)" \
-block Id -if Id@db -equals "SRA" -SRA Id \
-block Ids -first Id -element "&SRA" \
-STRAIN "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "strain" -STRAIN Attribute \
-block Attributes -element "&STRAIN" \
-ISOLATE "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "isolate" -ISOLATE Attribute \
-block Attributes -element "&ISOLATE" \
-ALIAS "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "isolate_name_alias" -ALIAS Attribute \
-block Attributes -element "&ALIAS" \
-SEROVAR "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "serovar" -SEROVAR Attribute \
-block Attributes -element "&SEROVAR" \
-SEROTYPE "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "serotype" -SEROTYPE Attribute \
-block Attributes -element "&SEROTYPE" \
-DATE "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "collection_date" -DATE Attribute \
-block Attributes -element "&DATE" \
-LATLON "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "lat_lon" -LATLON Attribute \
-block Attributes -element "&LATLON" \
-LOC "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "geo_loc_name" -LOC Attribute \
-block Attributes -element "&LOC" \
-HOST "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "host" -HOST Attribute \
-block Attributes -element "&HOST" \
-SOURCE "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "isolation_source" -SOURCE Attribute \
-block Attributes -element "&SOURCE" \
-PACKAGE "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "attribute_package" -PACKAGE Attribute \
-block Attributes -element "&PACKAGE" \
-IFSAC "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "IFSAC+ Category" -IFSAC Attribute \
-block Attributes -element "&IFSAC" \
-FOOD "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "FoodOn Ontology Term" -FOOD Attribute \
-block Attributes -element "&FOOD" \
-LAB "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "collected_by" -LAB Attribute \
-block Attributes -element "&LAB" \
-block Link -if Link@target -equals "bioproject" -tab "/" -element Link@label
ADD REPLY
1
Entering edit mode
6 months ago
Ash ▴ 10

I know this was asked several years ago, but finding this post helped me solve my own problem and in the process I solved this one too.

The issue is that samples from ENA do not always share a "attribute_name" value. What you need is "harmonized_name", which is consistent across all samples (that I've seen so far, anyway).

So where you have:

-STRAIN "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "strain" -STRAIN Attribute \
-block Attributes -element "&STRAIN" \

You should change that to:

-STRAIN "(NA)" \
-block Attribute -if Attribute@harmonized_name -equals "strain" -STRAIN Attribute \
-block Attributes -element "&STRAIN" \

For some of them you may also need to change -equals "x" part if the harmonized_name does not match the attribute_name, but for strain in particular it's the same value.

Here's how I did it in Python -- Entrez Direct needs to be on $PATH.

ADD COMMENT

Login before adding your answer.

Traffic: 2613 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6