I saw the following post that targets the Pseudomonas genome sequences. I would like to extract all Biosample IDs and their corresponding isolation sources from NCBI. Is that possible with bash using esearch, esummary and/or xtract? Does anyone know a script for this purpose?
extract all Biosample IDs and their corresponding isolation sources
from NCBI.
While it may be possible it is likely not practical for all IDs. You may want to check biosample file NCBI makes available here (large file!) to see if you can pare down to a smaller set of ID's and then use answer from the thread you linked above.
Thank you for your quick comment ! It turns out that it is a big size and the extraction is not realistic. Sorry to bother you again, but do you know how to get all BioSample isolation sources of the environmental metagenomes? I still don't know how to specify db and can't get any data. If possible, I would appreciate it if you could give me an example of script. I am sorry that I am not familiar with Linux-based analysis.
Using EntrezDirect. Following is a vague start, it would be challenging to deal with a query like "metagenome" since there are 1537782 hits as of today.
$ esearch -db biosample -query "metagenome" | esummary | xtract -pattern DocumentSummary -first Title -element Accession -group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute
MIMS Environmental/Metagenome sample from mouse gut metagenome SAMN15337269
MIMS Environmental/Metagenome sample from mouse gut metagenome SAMN15337268
Metagenome or environmental sample from metagenome SAMN15337148
Metagenome or environmental sample from metagenome SAMN15337147
Metagenome or environmental sample from metagenome SAMN15337146
Metagenome or environmental sample from metagenome SAMN15337145
Metagenome or environmental sample from metagenome SAMN15337144
I really appreciate your great help! In fact, the script worked fine on my computer, but when I got the data from 169,159 biosamples, I got the following messages:
Thank you for your continued help! I understood that I have to create an API key. Sorry to bother you again, but does this mean I need to insert "-api_key ???" in the script above?
So, in order to retrieve only the biosample information I need, I tried the following script and test file, but it didn't work either (there was only one result...). Is this also my script problem this time? I would appreciate it if you could check it when you have time.
$ cat biosample.id.list | epost -db biosample -format acc | esummary | xtract -pattern DocumentSummary -first Title -element Accession -group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute
Microbe sample from SAR116 cluster bacterium AG-426-B08 SAMN08886454
single cell amplified by WGA-X; seawater MIMS Environmental/Metagenome sample from SAR11 cluster bacterium PRT-SC02 SAMN02941851
single cell amplified by MDA; hadopelagic water column of the Puerto Rico Trench at 8200 m depth
Generic sample from Roseobacter sp. GAI101 SAMN02436271 seawater off the coast of Georgia
Note: You don't include NCBI_API_KEY in actual command. Just export that variable in the terminal you are running this search from once.
Any idea how is it possible to combine different atrributes in once single command? I would like to generate a table with "isolation_source", "host", "collection_date" from ~1000 Biosamples id.
Thanks
While it may be possible it is likely not practical for all IDs. You may want to check biosample file NCBI makes available here (large file!) to see if you can pare down to a smaller set of ID's and then use answer from the thread you linked above.
Dear genomax,
Thank you for your quick comment ! It turns out that it is a big size and the extraction is not realistic. Sorry to bother you again, but do you know how to get all BioSample isolation sources of the environmental metagenomes? I still don't know how to specify db and can't get any data. If possible, I would appreciate it if you could give me an example of script. I am sorry that I am not familiar with Linux-based analysis.
Best regards
Using EntrezDirect. Following is a vague start, it would be challenging to deal with a query like "metagenome" since there are
1537782
hits as of today.