I want to download the annotation file in gff3 format for the corresponding genome. As this fairly easy on the ncbi-webpage I don't find a possibility to do the same with efetch or the kind.
There are a couple of strategies you can try, depending on what you mean by $genome_id. In each case, it's a matter of finding the right FTP path, and then using wget to get the *genomic.gff.gz file in that path:
If you have assembly accessions, you can get FTP paths for each from
the assembly_summary.txt file, and loop through them with wget. See
Download All The Bacterial Genomes From Ncbi for a good post on the approach
If you have nucleotide sequence accessions for chromosomes, you can
use esearch to directly query the Assembly database, and get the FTP
path from the document summary:
If you have nucleotide sequence accessions that don't directly work for queries in the Assembly database (e.g. contigs or scaffolds), you can query in nucleotide first and link to assembly:
I don't think NBCI offers GFF formatted files through efetch (yet).
Probably the best you can do is either indeed do it manually on the website (if you don't have many to do) or efetch genbank format and convert that to gff.
Otherwise, depending on the organism(s) you look for, there might be 'dedicated' databases that offer direct gff download.
Unfortunately, GFF3 still hasn't been added to NCBI's E-utilities as a valid return type, despite having been added to the web tool a year or more ago. That said, we can take advantage of the web-based GFF retrieval tool directly – after inspecting network traffic while pulling GFFs from the NCBI web portal and playing around with the parameters, I was able to reverse engineer how to retrieve a GFF file given an accession number. The results can be retrieved using your favorite file retrieval tool (wget, cURL, etc.). Here's how I do it using wget:
[<acc.ver> in the example query string above should be replaced with your accession.version or accession, e.g. KC145265.1.]
N.B.: It's relatively straightforward to pull multiple GFFs from separate entries using a comma-separated list of identifiers, but I haven't stress tested this, nor have I slammed NCBI with so many queries that NCBI would feel compelled to block this type of web request. Here's a multi-identifier example:
To download the GFF files in Batch, prepare a list of accession numbers. Got to Batch Entrez. From dropdown menu choose "Assembly". Upload the accession number list and search. To retrieve GFFs click on the "Download Assemblies" and choose filetype gff. This will download gff files separately zipped for each accession number. Now since the files comes with their project names and you wish the gff with the in accession_name.gff format here is a simple trick. List all the unzipped files in a list.txt file and use the following code.
Hi,
You can use the NCBI Datasets command line tool. You can use your list of accessions as an input and customize your data package to only include the GFF3 file, like this:
This command will download a zip file with the following structure (in this example, I downloaded the gff3 for all dog genomes and unzipped it to the folder dog):
If the virus genome sequence is annotated and has an genome assembly accession number, the GFF3 file would be available through the genome service in the datasets CLI. For example: the sars2 reference genome has a nucleotide accession number NC_045512.2 and a genome assembly accession GCF_009858895.2.
If you want to download gff3 format for the corresponding genome, you should first have the accessions list of genome, just as others said above, however, if you only want to download gff3 format for the corresponding gene not genome, NCBI had added the function of --Download, which can roughly download the gff3 or other format such as bed, csv and vcf file.
number 2. is what i looked for. thanks