Question

Retrieve GFF3 file from ncbi

5

Entering edit mode

7.5 years ago

john ▴ 130

I want to download the annotation file in gff3 format for the corresponding genome. As this fairly easy on the ncbi-webpage I don't find a possibility to do the same with efetch or the kind.

I hoped I could use something like this:

esearch -db nuccore -query "$genome_id" | efetch -format gff3  > "$path_data/$genome_id.gff"

gff3 ncbi efetch • 19k views

ADD COMMENT • link updated 18 months ago by MirianT_NCBI ▴ 800 • written 7.5 years ago by john ▴ 130

15

Entering edit mode

7.3 years ago

ucpete ▴ 150

Unfortunately, GFF3 still hasn't been added to NCBI's E-utilities as a valid return type, despite having been added to the web tool a year or more ago. That said, we can take advantage of the web-based GFF retrieval tool directly – after inspecting network traffic while pulling GFFs from the NCBI web portal and playing around with the parameters, I was able to reverse engineer how to retrieve a GFF file given an accession number. The results can be retrieved using your favorite file retrieval tool (wget, cURL, etc.). Here's how I do it using wget:

wget -O /path/to/your.gff "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=gff3&id=<acc[.ver]>"

[<acc.ver> in the example query string above should be replaced with your accession.version or accession, e.g. KC145265.1.]

N.B.: It's relatively straightforward to pull multiple GFFs from separate entries using a comma-separated list of identifiers, but I haven't stress tested this, nor have I slammed NCBI with so many queries that NCBI would feel compelled to block this type of web request. Here's a multi-identifier example:

wget -O Human_picobirnavirus.gff "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=gff3&id=NC_007026.1,NC_007027.1"

ADD COMMENT • link 7.3 years ago by ucpete ▴ 150

0

Entering edit mode

Nice Solution to this problem!

ADD REPLY • link 5.5 years ago by microfuge ★ 2.0k

1

Entering edit mode

5.2 years ago

rohitsatyam102 ▴ 940

To download the GFF files in Batch, prepare a list of accession numbers. Got to Batch Entrez. From dropdown menu choose "Assembly". Upload the accession number list and search. To retrieve GFFs click on the "Download Assemblies" and choose filetype gff. This will download gff files separately zipped for each accession number. Now since the files comes with their project names and you wish the gff with the in accession_name.gff format here is a simple trick. List all the unzipped files in a list.txt file and use the following code.

while read p; do name=$(head -n 8 $p | tail -1 | cut -f 1 ); mv $p ${name}.gff; done < list.txt

Tada!! here you have your GFF3 files in your desired name format.

ADD COMMENT • link 5.2 years ago by rohitsatyam102 ▴ 940

1

Entering edit mode

2.6 years ago

MirianT_NCBI ▴ 800

Hi,
You can use the NCBI Datasets command line tool. You can use your list of accessions as an input and customize your data package to only include the GFF3 file, like this:

datasets download genome accession --inputfile accessions.txt --include gff3

This command will download a zip file with the following structure (in this example, I downloaded the gff3 for all dog genomes and unzipped it to the folder dog):

dog
|-- README.md
`-- ncbi_dataset
    `-- data
        |-- GCF_000002285.5
        |   `-- genomic.gff
        |-- GCF_005444595.1
        |   `-- genomic.gff
        |-- GCF_011100685.1
        |   `-- genomic.gff
        |-- GCF_013276365.1
        |   `-- genomic.gff
        |-- GCF_014441545.1
        |   `-- genomic.gff
        |-- assembly_data_report.jsonl
        `-- dataset_catalog.json

You can rename all the GFF3 files like this (from the folder ncbi_dataset):

mkdir gff3; for f in data/*/genomic.gff; do 
   out=$( echo $f | cut -f2 -d'/'); 
   cp $f gff3/${out}.gff; 
done

Same instructions from this post.

Feel free to reach out if you have any questions or run into any issues. :) I hope it helps!

ADD COMMENT • link 2.6 years ago by MirianT_NCBI ▴ 800

0

Entering edit mode

Sadly GFF3 is not available for viruses, only a JSON annotation file

ADD REPLY • link 20 months ago by Cornelius ▴ 80

0

Entering edit mode

If the virus genome sequence is annotated and has an genome assembly accession number, the GFF3 file would be available through the genome service in the datasets CLI. For example: the sars2 reference genome has a nucleotide accession number NC_045512.2 and a genome assembly accession GCF_009858895.2.

$ datasets download genome accession GCF_009858895.2 --include gff3
Collecting 1 genome record [================================================] 100% 1/1
Downloading: ncbi_dataset.zip    5.37kB valid zip structure -- files not checked
Validating package [================================================] 100% 4/4

$ unzip ncbi_dataset.zip -d virus-gff3
Archive:  ncbi_dataset.zip
  inflating: virus-gff3/README.md    
  inflating: virus-gff3/ncbi_dataset/data/assembly_data_report.jsonl  
  inflating: virus-gff3/ncbi_dataset/data/GCF_009858895.2/genomic.gff  
  inflating: virus-gff3/ncbi_dataset/data/dataset_catalog.json

ADD REPLY • link 18 months ago by MirianT_NCBI ▴ 800

0

Entering edit mode

2.6 years ago

jamesdong • 0

If you want to download gff3 format for the corresponding genome, you should first have the accessions list of genome, just as others said above, however, if you only want to download gff3 format for the corresponding gene not genome, NCBI had added the function of --Download, which can roughly download the gff3 or other format such as bed, csv and vcf file.

https://www.ncbi.nlm.nih.gov/tools/sviewer/seqtrackdata/

ADD COMMENT • link 2.6 years ago by jamesdong • 0

score 2 · Accepted Answer · 2018-02-05

There are a couple of strategies you can try, depending on what you mean by $genome_id. In each case, it's a matter of finding the right FTP path, and then using wget to get the *genomic.gff.gz file in that path:

If you have assembly accessions, you can get FTP paths for each from the assembly_summary.txt file, and loop through them with wget. See Download All The Bacterial Genomes From Ncbi for a good post on the approach
If you have nucleotide sequence accessions for chromosomes, you can use esearch to directly query the Assembly database, and get the FTP path from the document summary:

esearch -db assembly -query NC_000913.3 | esummary | xtract -pattern DocumentSummary -element Taxid,Organism,AssemblyAccession,FtpPath_RefSeq
If you have nucleotide sequence accessions that don't directly work for queries in the Assembly database (e.g. contigs or scaffolds), you can query in nucleotide first and link to assembly:

esearch -db nuccore -query NZ_GL379776.1 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element Taxid,Organism,AssemblyAccession,FtpPath_RefSeq

score 1 · Accepted Answer · 2018-02-04

1

Entering edit mode

7.5 years ago

lieven.sterck 15k

I don't think NBCI offers GFF formatted files through efetch (yet). Probably the best you can do is either indeed do it manually on the website (if you don't have many to do) or efetch genbank format and convert that to gff.

Otherwise, depending on the organism(s) you look for, there might be 'dedicated' databases that offer direct gff download.