How to retrive genomes of the isolates from specific regions? For example, If I want to retrive all the Escherichia genome fasta files from NCBI which are submitted from USA.
2
0
Entering edit mode
2.4 years ago
Jaykumar ▴ 50

I am beginning my work and was wondering how to do this.

USA coli Genome NCBI Escherichia • 1.4k views
ADD COMMENT
5
Entering edit mode
2.4 years ago
MirianT_NCBI ▴ 760

Hi Jaykumar,

You can use NCBI Datasets command line tool for this task. You will also need jq to process the metadata files in JSON. Here are the steps:

  1. Using the datasets summary option, get a list of accessions and location from USA only:
datasets summary genome taxon 562 --as-json-lines |\
 grep -E "\"value\":\"USA" |\
 jq -r '.assembly_accession as $accs | .biosample.attributes[] 
| select(.name == "geo_loc_name") 
| select(.value | contains("USA")) 
| [$accs,.value] 
| @tsv'

Alternatively, if you only want the accession numbers, you can do this:

datasets summary genome taxon 562 | jq -r '.assemblies[].assembly 
| select((.biosample.attributes[].name == "geo_loc_name") and (.biosample.attributes[].value|contains("USA"))) 
| .assembly_accession' > ecoli_usa_accessions.txt
  1. Using the datasets download option, you can download only the genomes from USA based on the list we created.
datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \
   --filename ecoli_usa.zip

This will download a data package with genomic sequences, as well as protein FASTA, CDS FASTA and GFF3, if they are available, plus metadata files. If you don't need all those files, you can exclude them like this:

datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \
   --exclude-genomic-cds --exclude-protein --exclude-gff3 \
   --filename ecoli_usa_seq_only.zip

Let me know if you have any other questions or run into any issues.

ADD COMMENT
1
Entering edit mode

I forgot to mention one more thing: since you'll be downloading a lot of data and files, I would recommend you to use the --dehydrated flag option when downloading. Like this:

datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \
   --dehydrated \
   --filename ecoli_usa_dehydrated.zip

This option will give you the metadata files and a txt with the paths to retrieve the data. Data retrieval will be faster and can be resumed if it fails. Here are the next steps:

  • Unzip the dehydrated package:

    unzip ecoli_usa_dehydrated.zip -d ecoli_usa
    
  • Rehydrate (aka retrieve/download) ALL data files:

    datasets rehydrate --directory ecoli_usa
    
  • As an alternative, you can retrieve only the genomic assembly files, like this:

    datasets rehydrate --directory ecoli_usa --match "GC.*/GC.*genomic.fna"
    

I hope it helps!

ADD REPLY
0
Entering edit mode

Thank you very much! It worked!!

ADD REPLY
2
Entering edit mode
2.4 years ago
GenoMax 147k

One way is to use EntrezDirect:

$  esearch -db assembly -query "562 [taxID]" | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,BioSampleAccn,SubmitterOrganization,FtpPath_GenBank | head -10
    GCA_024134465.1 SAMN29473037    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/134/465/GCA_024134465.1_PDT001351087.1
    GCA_024134005.1 SAMN29474218    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/134/005/GCA_024134005.1_PDT001351197.1
    GCA_024133985.1 SAMN29473020    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/985/GCA_024133985.1_PDT001351231.1
    GCA_024133965.1 SAMN29474283    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/965/GCA_024133965.1_PDT001351211.1
    GCA_024133945.1 SAMN29473019    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/945/GCA_024133945.1_PDT001351237.1
    GCA_024133825.1 SAMN29474253    Health Protection Agency        ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/825/GCA_024133825.1_PDT001351312.1
    GCA_024133685.1 SAMN29473230    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/685/GCA_024133685.1_PDT001351133.1
    GCA_024133585.1 SAMN29474221    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/585/GCA_024133585.1_PDT001351162.1
    GCA_024133405.1 SAMN29474277    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/405/GCA_024133405.1_PDT001351205.1
    GCA_024133385.1 SAMN29473022    CDC     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/024/133/385/GCA_024133385.1_PDT001351223.1

You can do more elaborate queries to check on the sample names in second column, if you can't parse entries you need from third column.

ADD COMMENT

Login before adding your answer.

Traffic: 2206 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6