How to use efetch to download genome/proteome/etc. at the command line?
2
0
Entering edit mode
13 months ago
dec986 ▴ 380

I'm trying to download transcriptomes from NCBI, one of which can be seen here: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/latest_assembly_versions/GCF_000146045.2_R64/GCF_000146045.2_R64_rna_from_genomic.fna.gz

and am trying to use NCBI's efetch to help. I've tried

efetch -db protein -id GCF_000146045.2 -format fasta

efetch -db gds -id GCF_000146045.2_R64

and numerous iterations thereof, but to no avail.

I've read through https://www.ncbi.nlm.nih.gov/books/NBK179288/ and https://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Writing_Advanced_Sea but nothing relevant seems to turn up.

I'm aware that this can be downloaded through the browser, or through wget with the link, but I need to script this to avoid errors and get the links. Preferably all I need to enter is species and rRNA.

How can I use efetch or some other command line tool to download this data at the command line?

esearch ncbi efetch • 1.7k views
ADD COMMENT
2
Entering edit mode
13 months ago
GenoMax 147k

Use NCBI datasets tool for this type of workloads (LINK)

datasets  download genome accession GCF_000146045.2 --include rna
ADD COMMENT
0
Entering edit mode

thank you! for reference, the programs can be downloaded from https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/ Is there a way of downloading the data without having to know the accession? i.e. just using the species name?

ADD REPLY
1
Entering edit mode

I think one way would be to use taxon ID for the species you are interested. One could always search for the accessions using EntreezDirect (which it is good at) and then use those with datasets.

ADD REPLY
1
Entering edit mode

Yes, you can use the species name. You need to use the option taxon and provide a scientific name, or common name or NCBI taxid:

$ datasets download genome taxon 559292 --include rna

Collecting 3 genome records [================================================] 100% 3/3
Downloading: ncbi_dataset.zip    2.79MB valid zip archive
Validating package files [================================================] 100% 4/4

$ unzip ncbi_dataset.zip -d 559292
Archive:  ncbi_dataset.zip
  inflating: 559292/README.md        
  inflating: 559292/ncbi_dataset/data/assembly_data_report.jsonl  
  inflating: 559292/ncbi_dataset/data/GCF_000146045.2/rna.fna  
  inflating: 559292/ncbi_dataset/data/dataset_catalog.json

Please let me know if you have any questions.

ADD REPLY
1
Entering edit mode

MirianT_NCBI: Can preview option actually list/show the accessions that would be included? As is the information shown is not very useful.

$ datasets download genome taxon bos --preview
New version of client (15.24.0) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Collecting 73  records [================================================] 100% 73/73
{"resource_updated_on":"2023-10-17T07:38:10Z","record_count":73,"estimated_file_size_mb":42006,"included_data_files":{"all_genomic_fasta":{"file_count":1760,"size_mb":57079.965}}}
ADD REPLY
1
Entering edit mode

GenoMax , you can check the list of accessions using a different command:

datasets summary genome taxon bos --report ids_only | jq

{
  "reports": [
    {
      "accession": "GCA_007844835.1",
      "source_database": "SOURCE_DATABASE_GENBANK"
    },
    {
      "accession": "GCA_017311355.1",
      "source_database": "SOURCE_DATABASE_GENBANK"
    },
    {
      "accession": "GCA_014182915.2",
      "source_database": "SOURCE_DATABASE_GENBANK"
    },
    {
      "accession": "GCA_946052875.1",
      "source_database": "SOURCE_DATABASE_GENBANK"
    },
    {
      "accession": "GCA_005887515.3",
      "source_database": "SOURCE_DATABASE_GENBANK"
    },
    {
      "accession": "GCA_027580245.1",
      "source_database": "SOURCE_DATABASE_GENBANK"
    },

The idea of the --preview flag under the download option is to give users an idea of the data package size. But I can see how it would be useful to have other info there. Is there anything else you would like to see? I can pass the suggestions to the team :) Thanks!

ADD REPLY
1
Entering edit mode
13 months ago
schlogl ▴ 160

Hi there! A time ago I got this: This is a text file with identifiers one by line. assm_accs.txt = GCA_______.1 GCA_______.2 ... You can change this parte of the code to get GFF/CDS/etc -> 'GCA_.' | sed 's/$/_genomic.fna.gz/') 'GCA_.' | sed 's/$/_cds_from_genomic.fna.gz/')... Also you can avoid using cat and the reading part. But at the end it worked fine. Good luck.

cat assm_accs.txt | while read -r acc ; do     esearch -db assembly -query $acc </dev/null         | esummary         | xtract -pattern DocumentSummary -element FtpPath_GenBank         | while read -r url ; do             fname=$(echo $url | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/') ;             wget "$url/$fname" ;         done ;     done
ADD COMMENT

Login before adding your answer.

Traffic: 2068 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6