E-utilities pipe to download different genomes
3
0
Entering edit mode
21 months ago
Mike ▴ 20

Hi all,

I want to download FASTA from NCBI based on GCF/A ID. I used this command at first:

esearch -db assembly -query GCF_000023565.1 | elink -target nuccore | efetch -format fasta > {genome_out}

to download genomes, but I noticed the result is "duplicated", meaning I get both >NC_ and >CP_ Complete Genomes.

Then I switched to the command

esearch -db nucleotide -query GCF_000023565.1 | efetch -format fasta > {genome_out}

which worked. (File size is half the original).

Now when I try downloading GCF_000648595 with the new command I get an empty file. When I switch back to the original command I do manage to get a fasta file but it seems "duplicated" again.

From the NCBI site I can see that the new command only works on IDs with "WGS project".

I hope some can shed some light on this, I am hoping for a command that can deal with both GCF_000023565.1 & GCF_000648595 examples, and if not than maybe switching between different commands based on some condition.

I have tried multiple other parameters and none seems to work.

Any help will be highly appreciated.

ncbi fasta wgs • 1.5k views
ADD COMMENT
0
Entering edit mode

There was a similar question earlier today.

ADD REPLY
0
Entering edit mode

Thanks, the command in that post is the exact one given to me in the said post by GenoMax. But, his was a NCBI error which occurred to me as well but is not the current issue I'm facing.

ADD REPLY
1
Entering edit mode
21 months ago
GenoMax 147k

This will get you RefSeq entries.

$ esearch -db assembly -query GCF_000023565.1 | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if SourceDb -contains refseq -element Caption | efetch -db nuccore -format fasta

Beware that there can be multiple entries.

$ esearch -db assembly -query GCF_000648595 | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if SourceDb -contains refseq -element Caption 
NZ_ADOU02000013
NZ_ADOU02000012
NZ_ADOU02000011
NZ_ADOU02000010
NZ_ADOU02000009
NZ_ADOU02000008
NZ_ADOU02000007
NZ_ADOU02000006
NZ_ADOU02000005
NZ_ADOU02000004
NZ_ADOU02000003
NZ_ADOU02000002
NZ_ADOU02000001
ADD COMMENT
0
Entering edit mode
21 months ago
Mike ▴ 20

I somewhat solved the issue for now, using the ftp path of each genome to download the zipped fna file and unzipping it.

ADD COMMENT
1
Entering edit mode

You are better off using NCBI datasets tool for this purpose. EntrezDirect was not designed to do large downloads like the entire genomes.

See --> downloading genomes in fasta format from accession ids

ADD REPLY
0
Entering edit mode
21 months ago
5heikki 11k

There's now a more convenient way to do this kind of stuff. Check out NCBI Datasets

ADD COMMENT
0
Entering edit mode

datasets already recommended in my comment above.

ADD REPLY

Login before adding your answer.

Traffic: 2100 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6