Question

Get just GenBank record while downloading genome with Biopython

0

Entering edit mode

5.9 years ago

Shred ★ 1.5k

Guys I wrote a script to download genome in gbk from NCBI while querying with specific keywords. What I want is the full annotated genome: currently I'm querying the "nucleotide" database, and I get (in my specific case) two results: the RefSeq record and the Genbank one. I'm expecting just one record, because there's just a reference genome for the organism queried. As I've read from NCBI website, in this case the RefSeq is just a referrer to the GenBank one (source), with no sequence inside. So, here's the point: is there a way to download just the genbank record with sequence inside, and by so discarding all the useless record gained? Here's my code:

from Bio import SeqIO
from Bio import Entrez

Entrez.email = "mail@gmail.com"
search_term = "Bifidobacterium+bifidum+PRL2010[organism] AND complete+genome[title]"
handle = Entrez.esearch(db="nucleotide", term=search_term)
genome_ids = Entrez.read(handle)['IdList']

for genome_id in genome_ids:
    record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")
    filename = 'GenBank_Record_{}.gbk'.format(genome_id)
    print('Writing:{}'.format(filename))
    with open(filename, "w") as f:
        f.write(record.read())
print(genome_ids)

Biopython Entrez Genome • 1.8k views

ADD COMMENT • link updated 5.9 years ago by vkkodali_ncbi ★ 3.8k • written 5.9 years ago by Shred ★ 1.5k

score 1 · Answer 1 · 2019-01-01

1

Entering edit mode

5.9 years ago

vkkodali_ncbi ★ 3.8k

Change your search_term to include GenBank or RefSeq filter as shown below for GenBank and RefSeq sequences, respectively

## GenBank sequences only
search_term = "Bifidobacterium+bifidum+PRL2010[organism] AND complete+genome[title] AND genbank[filter]"
## RefSeq sequences only
search_term = "Bifidobacterium+bifidum+PRL2010[organism] AND complete+genome[title] AND refseq[filter]"

If you are fetching a whole bunch of sequences, you may be interested in knowing about the implementation of Eutils API keys here to avoid any HTTP 429 errors.

ADD COMMENT • link 5.9 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Thanks for the API recommend. Adding GenBank filter works, but in term of annotation this could be a problem, because reference genomes are by default more accurate than standard GenBank submission. I'm implementing a for loop to iterate into downloaded records to cut off sequence free files. It's crazy thinking on how much confused are submission in bioinformatics.

ADD REPLY • link 5.9 years ago by Shred ★ 1.5k

0

Entering edit mode

I'm implementing a for loop to iterate into downloaded records to cut off sequence free files.

Change your rettype to gbwithparts and all RefSeq flatfiles will be downloaded with contig sequences.