I want to get the isolation source (clinical/environmental) information for all RefSeq Pseudomonas aeruginosa genomes. Roughly, I know around 2000 sequenced Pseudomonas aeruginosa are available in NCBI. Sometimes the isolation source are mentioned in the Biosample e.g. https://www.ncbi.nlm.nih.gov/biosample/SAMN02732279/ . As I want to get the info at a time for 2000 genomes, how can I retrieve it by using bash? Any known script for this purpose?
You can use Entrez Direct for this. As you know, not all of the BioSample entries have all of the information you want and even when they do, it is not always under the same attribute. You may want to look at the XML output of esummary and come up with a suitable xtract command that will fetch all of the fields you want. As an example, you can use the following query to fetch the name, Biosample accession and the isolation source in a three column tab-delimited format:
## WARNING: returns >3000 results; only first five are shown here
esearch -db assembly -q '"Pseudomonas aeruginosa"[Organism] AND latest_refseq[filter]' \
| elink -db assembly -target biosample -name assembly_biosample \
| esummary \
| xtract -pattern DocumentSummary -first Title -element Accession \
-group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute
Pseudomonas aeruginosa CLJ1 SAMN07372049 lungs (tracheal aspirate)
Pseudomonas aeruginosa CLJ3 SAMN07372048 lungs (tracheal aspirate)
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa SAMN10374626 skin
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa SAMN10374625 Bronchial aspirate
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa SAMN10374624 Biopsy
Thanks for that. It works for me, but I got only 1000 results. I have a table with all assembly_ID (eg. GCF_000006765) of Pseudomonas aeruginosa, so I need to map back this table. How can I map back assembly id with biosample accession?
Could this be because a large number of the Biosample entries lack isolation_source information? If you run the command as shown above, you should see >3000 rows in the results but the cases lacking isolation source information will only have two columns of data instead of three. You can pick out a few of those and go digging around in the Biosample DocumentSummary XML for other attributes that may be of use to you.
How can I map back assembly id with biosample accession?
You can use Entrez Direct for this as shown below. Once you have this table for all of your data, you can join it to the one with isolation source results on column 2.
Hi vkkodali ! Could you please post a tutorial how to annotate a bacterial assembly using NCBI eutils? If possible, both online and offline annotation. This would help many visitors here.
Thanks for that. It works for me, but I got only 1000 results. I have a table with all assembly_ID (eg. GCF_000006765) of Pseudomonas aeruginosa, so I need to map back this table. How can I map back assembly id with biosample accession?
Could this be because a large number of the Biosample entries lack
isolation_source
information? If you run the command as shown above, you should see >3000 rows in the results but the cases lacking isolation source information will only have two columns of data instead of three. You can pick out a few of those and go digging around in the Biosample DocumentSummary XML for other attributes that may be of use to you.You can use Entrez Direct for this as shown below. Once you have this table for all of your data, you can join it to the one with isolation source results on column 2.
Hi vkkodali ! Could you please post a tutorial how to annotate a bacterial assembly using NCBI eutils? If possible, both online and offline annotation. This would help many visitors here.
One solution, I have just got:
You just need a xml2 to download.