Question

From GCA identifiers, download Genbank file format?

0

Entering edit mode

3.1 years ago

MSRS ▴ 590

Hi, I have found a number of posts about downloading files from NCBI. From a post, I found CLI tools but it can only download fasta, gff3, protein format files from GCA identifier GenBank acc. list (GCA_001874685.1, GCA_021460555.1), but not GenBank file.

Is there any way to download the full genebank file from GenBank accession list (Assembly)?

GCA_001874685.1
GCA_021460555.1
GCA_001874915.1

Thanks in advance

NCBI GenBank • 2.8k views

ADD COMMENT • link updated 9 weeks ago by cmdcolin ★ 4.2k • written 3.1 years ago by MSRS ▴ 590

score 3 · Accepted Answer · 2022-04-27

3

Entering edit mode

3.1 years ago

vkkodali_ncbi ★ 3.8k

NCBI Datasets and the associated command line tool datasets can be used to download GenBank flat files for a GCA accession. It is not a default setting, so you need to add it to the command line as shown below:

datasets download genome accession GCA_001874685.1 --include-gbff

ADD COMMENT • link 3.1 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Thank you so much.

ADD REPLY • link 3.1 years ago by MSRS ▴ 590

0

Entering edit mode

thanks, great tool. not sure if there is a way to automatically unzip the ncbi_dataset.zip file that it downloads, but I added "unzip -o ncbi_dataset.zip; rm -f ncbi_dataset.zip" after this command.

ADD REPLY • link 9 weeks ago by cmdcolin ★ 4.2k

score 2 · Accepted Answer · 2022-04-27

Hi, the NCBI provides ftp access to required files with directory structure based on the accession numbers.

FTP method

e.g. files for GCA_001874685.1 is stored in ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/874/685/GCA_001874685.1_ASM187468v1/ and among them is a GCA_001874685.1_ASM187468v1_genomic.gbff.gz.

So the only bit of information that you don't know is the '_ASM...' part. You can now look inside the .../685 directory and download only the gbff.gz file from the directory starting with GCA_001874685.1. This can be done with some ftp client (or NCBI's aspera download utility).

entrez method

esearch -query GCA_001874685.1 -db assembly | esummary | xtract -pattern DocumentSummary -element FtpPath_GenBank
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/874/685/GCA_001874685.1_ASM187468v1

now you may do the wget

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/874/685/GCA_001874685.1_ASM187468v1/*.gbff.gz

# if you want e.g. only the genomic gbff then "*_genomic.gbff.gz" will do the trick

entrez method 2

# of course you can use entrez more, so smth like this will work
esearch -query GCA_001874685.1 -db assembly | elink -target nuccore | efetch -format gb

# but note, that you've received records from RefSeq instead if GenBank (for which you have accession).
# I don't know from the top of my head how to filter the RefSeq records out.