I am trying to download records from NCBI's BioSample database using Entrez Direct. I'm having particular issues with getting the BioProject ID(s) associated with some, but not all BioSample records. I've found that sometimes the BioProject ID is found in the "Links" block of the XML object, which prompted me to write the following:
esearch -db biosample -query SAMN04362913 | efetch -format docsum | xtract -pattern BioSample \
-SRA "(NA)" \
-block Id -if Id@db -equals "SRA" -SRA Id \
-block Ids -first Id -element "&SRA" \
-DATE "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "collection_date" -DATE Attribute \
-block Attributes -element "&DATE" \
-LOC "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "geo_loc_name" -LOC Attribute \
-block Attributes -element "&LOC" \
-HOST "(NA)" \
-block Attribute -if Attribute@attribute_name -equals "host" -HOST Attribute \
-block Attributes -element "&HOST" \
-block Link -if Link@target -equals "bioproject" -tab "/" -element Link@label
For which I get as an output:
SAMN04362913 SRS1219238 None United Kingdom: None Homo sapiens PRJNA248792
However, I've discovered that this doesn't work for all BioSamples. Specifically, all BioSamples starting with "SAME" (from ENA/EBI) and some BioSamples starting with "SAMD" (from DDBJ) do not output the BioProject ID(s). For example, on NCBI's BioSample SAMEA5548256 webpage, the BioProject ID is listed as PRJEB30317, but when I run the above code, I get the following:
SAMEA5548256 ERS3350306 NA NA NA
Upon closer inspection, it appears that the "Links" block of the XML object is missing entirely despite a BioProject ID being present on the website.
Anyone know why the "Links" block is absent from the XML object for some samples? Is there a way around this so I can pull the BioProject ID(s) for any sample?
Any help would be much appreciated! Thank you!
UPDATE: I can't use the SRA database instead of the BioSample database because 1) I want a bunch of sample collection metadata that's only included in the BioSample database (e.g. host, collection date, etc.) and 2) many samples I want info on do not have corresponding entries in the SRA database (e.g. SAMN10656824). I had shortened the code above for ease of reading, but I updated it to reflect my need for sample metadata.
That would work except that 1) I want a bunch of sample collection metadata that's only included in the BioSample database and 2) Many samples I'm looking at do not have corresponding entries in the SRA database.
Can you post examples? Also what kind of metadata are you looking at?
I just updated my post. My apologies for leaving it out originally!
The FULL code I'm currently using is: