Question

Where have the bacterial genomes gone in Genbank ftp?

2

Entering edit mode

8.9 years ago

briony ▴ 20

The Genbank ftp site (ftp://ftp.ncbi.nih.gov/genomes/) used to contain a folder called Bacteria, with all the bacterial genomes, but it seems to have disappeared. Does anyone know where these might have been moved to? I can't find anything about it on the NCBI site, and I need to access some gff files ASAP.

bacteria GenBank genomes • 6.2k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.9 years ago by briony ▴ 20

0

Entering edit mode

yes all the link i tried showing The webpage at ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ might be temporarily down or it may have moved permanently to a new web address. why this is so?any wifi/proxy error?

ADD REPLY • link 4.8 years ago by hirakuda • 0

0

Entering edit mode

DO these links work from any where in the world except USA?

ADD REPLY • link 4.8 years ago by hirakuda • 0

0

Entering edit mode

They should work around most of the world. Fair warning. These links have large folder listings. We don't know what happens if you are in one of small number of countries that have US export restrictions in place.

You can try using Ensembl Bacteria site as an alternative.

ADD REPLY • link 4.8 years ago by GenoMax 147k

0

Entering edit mode

thanks . I got it there in t he last link

ADD REPLY • link 4.8 years ago by hirakuda • 0

0

Entering edit mode

The files I downloaded from there is in .fz.tgz format. I searched in google , I only found how tgz files can be extracted. Can anyone please tell me what is fz.tgz file and how to extract it in linux?

ADD REPLY • link 4.8 years ago by hirakuda • 0

0

Entering edit mode

What fz.tgz files are you talking about? Can you post a link for one?

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

ADD REPLY • link 4.8 years ago by GenoMax 147k

0

Entering edit mode

Can you please give me the newest link for bacteria genome fasta file? And also the way to download it from FTP folder in linux. All the links given is saying cant be reached. Some people said this may be my email id is changing the URL everytime for security reason.

ADD REPLY • link 4.8 years ago by hirakuda • 0

0

Entering edit mode

Please do not add comments as new answers on existing threads.

There is no single/newest link for bacteria genome fasta file. Each bacterial genome is going to have a separate fasta file. Decide what genome you want to use first and then download the file for just that genome. Here is a direct link for an example file for Escherichia coli genome.

ADD REPLY • link 4.8 years ago by GenoMax 147k

0

Entering edit mode

Thank you all. There was a ban for ftp portal for my wifi . I have solved it and able to access the files smoothly. Now in https://bacteria.ensembl.org/info/website/ftp/index.html portal , any bacteria has many fa.gz files. like dna.nonchromosomal , dma.toplevel , dna_rm.nonchromosomal etc. Any idea which file I should use?

ADD REPLY • link 4.8 years ago by hirakuda • 0

0

Entering edit mode

If the ftp issues has been solved then follow my directions here: Do ncbi public site does not have bacteria genome?

If you want to use Ensembl site then get the file that says dna.toplevel. Explanation for what those files are in the README file you see in those directories.

ADD REPLY • link 4.8 years ago by GenoMax 147k

2

Entering edit mode

4.8 years ago

Istvan Albert 102k

The new mode of operation for all organisms to get the assembly_summary_refseq.txt from:

ftp://ftp.ncbi.nih.gov/genomes/refseq/

Now if you do a

URL=ftp://ftp.ncbi.nih.gov/genomes/refseq/assembly_summary_refseq.txt
curl -s $URL | cut -f 1,6,7,20 | head

will print:

#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession    taxid   species_taxid   ftp_path
GCF_000001215.4 7227    7227    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT
GCF_000001405.39    9606    9606    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13
GCF_000001635.26    10090   10090   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6
GCF_000001735.4 3702    3702    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1

The 20 column contains the directory that the data is deposited in, for example

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13

The data there is distributed in various formats. To get the GFF file you can do a:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz

Now to get bacterial genomes, say E Coli you can filter for taxid, or other fields of interest:

curl -s $URL | grep 562 | cut -f 1,6,8,7,20 | head

it will print:

GCF_000005845.2 511145  562 Escherichia coli str. K-12 substr. MG1655   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2
GCF_000006665.1 155864  562 Escherichia coli O157:H7 str. EDL933    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/665/GCF_000006665.1_ASM666v1
GCF_000007445.1 199310  562 Escherichia coli CFT073 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/765/GCF_000008765.1_ASM876v1
GCF_000008865.2 386585  562 Escherichia coli O157:H7 str. Sakai
 ...

ADD COMMENT • link 4.8 years ago by Istvan Albert 102k

0

Entering edit mode

For additional info, this is effectively what Kai Blin's tool ncbi-genome-download ( https://github.com/kblin/ncbi-genome-download ) is doing, so if you would prefer not to get your hands dirty with the assembly summary etc. you can alternatively use that.

There are a number of other threads on the forum with examples of its usage (searching ncbi-genome-download or ngd will likely turn up results).

ADD REPLY • link 4.8 years ago by Joe 21k

0

Entering edit mode

Also there are several Lactobacilus gasseri with different gene ID with same sub sp.. Why is so? How to know which gene id is useful for me?

ADD REPLY • link 4.8 years ago by hirakuda • 0

0

Entering edit mode

Every strain genome will have its own set of ID's. If there are multiple genomes of Lactobacilus gasseri available then choose one genome you want to use. A representative example is this.

ADD REPLY • link 4.8 years ago by GenoMax 147k

0

Entering edit mode

In the absence of any other information, use the RefSeq entry for it (if there is one). Otherwise, only you can decide which one is most useful to you.

In practice, any one of them will probably be fine as they may just be different isolates of the same thing, or subtly different strains of a given sub-species.

ADD REPLY • link 4.8 years ago by Joe 21k

Ram · Accepted Answer · 2016-01-19

6

Entering edit mode

8.9 years ago

Andrzej Zielezinski 11k

Assembled genome sequence and annotation data for GenBank genome assemblies is now (from 02-DEC-2015) available under: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Andrzej Zielezinski 11k

3

Entering edit mode

and refseq here: ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/

Be careful with clicking on those links in a browser these are huge directories.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Michael 55k