Question

Downloading bacterial genomes: correspondence between ftp ncbi identifiers and the "nucleotide" database

0

Entering edit mode

3.6 years ago

Debut ▴ 20

Hello, I am a beginner in bioinformatics. I have to get all the sequences of the klebsiella genome from ncbi. I have to use biopython for my internship. except that I absolutely need the number that there is in the link ( https://www.ncbi.nlm.nih.gov/genome/browse/#!/prokaryotes/815/ ) from "Genome Assembly and Annotation report (10703)" so I recovered the identifiers from " https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt " and I tried to make a script in biopython that allows me to recover the sequences. But the script doesn't work, I guess the identifiers on ftp and on the nucleotide database are not the same. I would like to know if there is some kind of correspondence between the nuccore(nucleotide) identifiers and the ones on ftp.

I have looked at https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/IDS/Bacteria.ids and there are only 31 identifiers and not all of them. Thanks a lot. Here is the biopython code:

from Bio import SeqIO
from Bio import Entrez

list_id = []
file = open("listId.txt", "r")
readlineFile=file.readline()
print(readline)
for line in file:
    file.readline()
    list_id.append(line)
print(List_id)
fic_seq = Entrez.efetch(db="nucleotide", id="list_id", rettype="gb")
my_seq=SeqIO.parse(fic_seq,"gb")
for seq in my_seq :
    print (seq)
my_seq=SeqIo.parse(fic_seq,"gb")
SeqIO.write(my_seq, "out.fasta", "fasta")
fic_seq.close()

python ncbi biopython • 1.6k views

ADD COMMENT • link updated 3.6 years ago by vkkodali_ncbi ★ 3.8k • written 3.6 years ago by Debut ▴ 20

0

Entering edit mode

Try executing your efetch successfully with a single ID. Once done, expand to working with multiple IDs the right way. Right now, you're using the string "list_id" as an ID, where you need to be using every member in the object list_id. And of course, ensure the IDs are ones you can use for retrieval as well.

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

Thank you for your answer. Yes, when I take an identifier it doesn't work because the identifier of the nucleotide database and of the ftp file are different. I am looking for a way to link the identifiers of the nucleotide db and the identifiers of the ftp file

ADD REPLY • link 3.6 years ago by Debut ▴ 20

0

Entering edit mode

when I use an identifier of the first column which corresponds to the identifier of the db nucleotide of this link " https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/IDS/Bacteria.ids " it works

ADD REPLY • link 3.6 years ago by Debut ▴ 20

score 1 · Answer 1 · 2021-06-04

1

Entering edit mode

3.6 years ago

vkkodali_ncbi ★ 3.8k

I suggest using NCBI Datasets. You can use the command-line tool or the python library. If you choose to use the command-line tool, you can first extract the list of GCA/GCF accessions from https://www.ncbi.nlm.nih.gov/genome/browse/#!/prokaryotes/815/ and use that as an input to download the GenBank files for all assemblies.

ADD COMMENT • link 3.6 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Thank you for your answer, I try to write in Biopyhton but the nuccore identifiers of the database "nucleotide" and the sidentifiers "CGA..." are not the same and for the command Entrez.efectch() it is necessary to put in arguments the identifiers of the database because with the code that I have, there is no conversion of identifiers

ADD REPLY • link 3.6 years ago by Debut ▴ 20

1

Entering edit mode

If I understand correctly, your starting point is this table: https://www.ncbi.nlm.nih.gov/genome/browse/#!/prokaryotes/815/ From here, you want to download all the genomes in GenBank format. And you need to use biopython for that.

Entrez.efetch, as you have noted, does not accept the GCF/GCA genome assembly identifiers. It only accepts individual nucleotide identifiers. From your starting table, which you can download as a TSV file from the web you should be able to extract all of the nucleotide identifiers in the "Replicons" column. Based on the accession format specified here you should be able to use regular expressions to extract the correct set of identifiers from the Replicons column. At that point, you can use Entrez.efetch with the nucleotide database to download data.

I must warn you that this is a very inefficient way of downloading large amount of data from NCBI. Entrez.efetch is not designed for this sort of thing and although it will work, it will take a long time to download >10k genomes. For genome-level and mulitple genome downloads, you are better off either using the FTP urls in the GENOME REPORTS file or, preferably, the NCBI Datasets tool.

ADD REPLY • link 3.6 years ago by vkkodali_ncbi ★ 3.8k