I am beginning my work and was wondering how to do this.
I am beginning my work and was wondering how to do this.
Hi Jaykumar,
You can use NCBI Datasets command line tool for this task. You will also need jq to process the metadata files in JSON. Here are the steps:
datasets summary
option, get a list of accessions and location from USA only: datasets summary genome taxon 562 --as-json-lines |\
grep -E "\"value\":\"USA" |\
jq -r '.assembly_accession as $accs | .biosample.attributes[]
| select(.name == "geo_loc_name")
| select(.value | contains("USA"))
| [$accs,.value]
| @tsv'
Alternatively, if you only want the accession numbers, you can do this:
datasets summary genome taxon 562 | jq -r '.assemblies[].assembly
| select((.biosample.attributes[].name == "geo_loc_name") and (.biosample.attributes[].value|contains("USA")))
| .assembly_accession' > ecoli_usa_accessions.txt
datasets download
option, you can download only the genomes from USA based on the list we created.datasets download genome accession \
--inputfile ecoli_usa_accessions.txt \
--filename ecoli_usa.zip
This will download a data package with genomic sequences, as well as protein FASTA, CDS FASTA and GFF3, if they are available, plus metadata files. If you don't need all those files, you can exclude them like this:
datasets download genome accession \
--inputfile ecoli_usa_accessions.txt \
--exclude-genomic-cds --exclude-protein --exclude-gff3 \
--filename ecoli_usa_seq_only.zip
Let me know if you have any other questions or run into any issues.
One way is to use EntrezDirect:
$ esearch -db assembly -query "562 [taxID]" | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,BioSampleAccn,SubmitterOrganization,FtpPath_GenBank | head -10
GCA_024134465.1 SAMN29473037 CDC ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/134/465/GCA_024134465.1_PDT001351087.1
GCA_024134005.1 SAMN29474218 CDC ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/134/005/GCA_024134005.1_PDT001351197.1
GCA_024133985.1 SAMN29473020 CDC ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/985/GCA_024133985.1_PDT001351231.1
GCA_024133965.1 SAMN29474283 CDC ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/965/GCA_024133965.1_PDT001351211.1
GCA_024133945.1 SAMN29473019 CDC ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/945/GCA_024133945.1_PDT001351237.1
GCA_024133825.1 SAMN29474253 Health Protection Agency ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/825/GCA_024133825.1_PDT001351312.1
GCA_024133685.1 SAMN29473230 CDC ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/685/GCA_024133685.1_PDT001351133.1
GCA_024133585.1 SAMN29474221 CDC ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/585/GCA_024133585.1_PDT001351162.1
GCA_024133405.1 SAMN29474277 CDC ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/405/GCA_024133405.1_PDT001351205.1
GCA_024133385.1 SAMN29473022 CDC ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/385/GCA_024133385.1_PDT001351223.1
You can do more elaborate queries to check on the sample names in second column, if you can't parse entries you need from third column.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I forgot to mention one more thing: since you'll be downloading a lot of data and files, I would recommend you to use the
--dehydrated
flag option when downloading. Like this:This option will give you the metadata files and a txt with the paths to retrieve the data. Data retrieval will be faster and can be resumed if it fails. Here are the next steps:
Unzip the dehydrated package:
Rehydrate (aka retrieve/download) ALL data files:
As an alternative, you can retrieve only the genomic assembly files, like this:
I hope it helps!
Thank you very much! It worked!!