Get just GenBank record while downloading genome with Biopython
1
0
Entering edit mode
5.9 years ago
Shred ★ 1.5k

Guys I wrote a script to download genome in gbk from NCBI while querying with specific keywords. What I want is the full annotated genome: currently I'm querying the "nucleotide" database, and I get (in my specific case) two results: the RefSeq record and the Genbank one. I'm expecting just one record, because there's just a reference genome for the organism queried. As I've read from NCBI website, in this case the RefSeq is just a referrer to the GenBank one (source), with no sequence inside. So, here's the point: is there a way to download just the genbank record with sequence inside, and by so discarding all the useless record gained? Here's my code:

from Bio import SeqIO
from Bio import Entrez

Entrez.email = "mail@gmail.com"
search_term = "Bifidobacterium+bifidum+PRL2010[organism] AND complete+genome[title]"
handle = Entrez.esearch(db="nucleotide", term=search_term)
genome_ids = Entrez.read(handle)['IdList']

for genome_id in genome_ids:
    record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")
    filename = 'GenBank_Record_{}.gbk'.format(genome_id)
    print('Writing:{}'.format(filename))
    with open(filename, "w") as f:
        f.write(record.read())
print(genome_ids)
Biopython Entrez Genome • 1.8k views
ADD COMMENT
1
Entering edit mode
5.9 years ago
vkkodali_ncbi ★ 3.8k

Change your search_term to include GenBank or RefSeq filter as shown below for GenBank and RefSeq sequences, respectively

## GenBank sequences only
search_term = "Bifidobacterium+bifidum+PRL2010[organism] AND complete+genome[title] AND genbank[filter]"
## RefSeq sequences only
search_term = "Bifidobacterium+bifidum+PRL2010[organism] AND complete+genome[title] AND refseq[filter]"

If you are fetching a whole bunch of sequences, you may be interested in knowing about the implementation of Eutils API keys here to avoid any HTTP 429 errors.

ADD COMMENT
0
Entering edit mode

Thanks for the API recommend. Adding GenBank filter works, but in term of annotation this could be a problem, because reference genomes are by default more accurate than standard GenBank submission. I'm implementing a for loop to iterate into downloaded records to cut off sequence free files. It's crazy thinking on how much confused are submission in bioinformatics.

ADD REPLY
0
Entering edit mode

I'm implementing a for loop to iterate into downloaded records to cut off sequence free files.

Change your rettype to gbwithparts and all RefSeq flatfiles will be downloaded with contig sequences.

ADD REPLY
0
Entering edit mode

Fine, that's what I've been looking for.

ADD REPLY

Login before adding your answer.

Traffic: 2440 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6