Guys I wrote a script to download genome in gbk from NCBI while querying with specific keywords. What I want is the full annotated genome: currently I'm querying the "nucleotide" database, and I get (in my specific case) two results: the RefSeq record and the Genbank one. I'm expecting just one record, because there's just a reference genome for the organism queried. As I've read from NCBI website, in this case the RefSeq is just a referrer to the GenBank one (source), with no sequence inside. So, here's the point: is there a way to download just the genbank record with sequence inside, and by so discarding all the useless record gained? Here's my code:
from Bio import SeqIO
from Bio import Entrez
Entrez.email = "mail@gmail.com"
search_term = "Bifidobacterium+bifidum+PRL2010[organism] AND complete+genome[title]"
handle = Entrez.esearch(db="nucleotide", term=search_term)
genome_ids = Entrez.read(handle)['IdList']
for genome_id in genome_ids:
record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")
filename = 'GenBank_Record_{}.gbk'.format(genome_id)
print('Writing:{}'.format(filename))
with open(filename, "w") as f:
f.write(record.read())
print(genome_ids)
Thanks for the API recommend. Adding GenBank filter works, but in term of annotation this could be a problem, because reference genomes are by default more accurate than standard GenBank submission. I'm implementing a for loop to iterate into downloaded records to cut off sequence free files. It's crazy thinking on how much confused are submission in bioinformatics.
Change your
rettype
togbwithparts
and all RefSeq flatfiles will be downloaded with contig sequences.Fine, that's what I've been looking for.