Question

How to retrive genomes of the isolates from specific regions? For example, If I want to retrive all the Escherichia genome fasta files from NCBI which are submitted from USA.

0

Entering edit mode

2.9 years ago

Jaykumar ▴ 50

I am beginning my work and was wondering how to do this.

USA coli Genome NCBI Escherichia • 1.8k views

ADD COMMENT • link 2.9 years ago by Jaykumar ▴ 50

2

Entering edit mode

2.9 years ago

GenoMax 151k

One way is to use EntrezDirect:

$  esearch -db assembly -query "562 [taxID]" | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,BioSampleAccn,SubmitterOrganization,FtpPath_GenBank | head -10
    GCA_024134465.1 SAMN29473037    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/134/465/GCA_024134465.1_PDT001351087.1
    GCA_024134005.1 SAMN29474218    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/134/005/GCA_024134005.1_PDT001351197.1
    GCA_024133985.1 SAMN29473020    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/985/GCA_024133985.1_PDT001351231.1
    GCA_024133965.1 SAMN29474283    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/965/GCA_024133965.1_PDT001351211.1
    GCA_024133945.1 SAMN29473019    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/945/GCA_024133945.1_PDT001351237.1
    GCA_024133825.1 SAMN29474253    Health Protection Agency        ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/825/GCA_024133825.1_PDT001351312.1
    GCA_024133685.1 SAMN29473230    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/685/GCA_024133685.1_PDT001351133.1
    GCA_024133585.1 SAMN29474221    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/585/GCA_024133585.1_PDT001351162.1
    GCA_024133405.1 SAMN29474277    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/405/GCA_024133405.1_PDT001351205.1
    GCA_024133385.1 SAMN29473022    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/385/GCA_024133385.1_PDT001351223.1

You can do more elaborate queries to check on the sample names in second column, if you can't parse entries you need from third column.

ADD COMMENT • link 2.9 years ago by GenoMax 151k

score 5 · Accepted Answer · 2022-07-06

Hi Jaykumar,

You can use NCBI Datasets command line tool for this task. You will also need jq to process the metadata files in JSON. Here are the steps:

Using the datasets summary option, get a list of accessions and location from USA only:

datasets summary genome taxon 562 --as-json-lines |\
 grep -E "\"value\":\"USA" |\
 jq -r '.assembly_accession as $accs | .biosample.attributes[] 
| select(.name == "geo_loc_name") 
| select(.value | contains("USA")) 
| [$accs,.value] 
| @tsv'

Alternatively, if you only want the accession numbers, you can do this:

datasets summary genome taxon 562 | jq -r '.assemblies[].assembly 
| select((.biosample.attributes[].name == "geo_loc_name") and (.biosample.attributes[].value|contains("USA"))) 
| .assembly_accession' > ecoli_usa_accessions.txt

Using the datasets download option, you can download only the genomes from USA based on the list we created.

datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \
   --filename ecoli_usa.zip

This will download a data package with genomic sequences, as well as protein FASTA, CDS FASTA and GFF3, if they are available, plus metadata files. If you don't need all those files, you can exclude them like this:

datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \
   --exclude-genomic-cds --exclude-protein --exclude-gff3 \
   --filename ecoli_usa_seq_only.zip

Let me know if you have any other questions or run into any issues.