Database from .gbk files
1
1
Entering edit mode
4.5 years ago
graysonford ▴ 10

Hello all,

I would like to use a relatively old program called MultiGeneBlast (from 2015) to do some work. The database packaged with this software is also from 2015 and since then, I believe there are a lot more bacterial genomes on the ncbi website.

The software has a create your own database (where it downloads genbank files from online and creates a database) but no longer works and only accepts .gbk and .embl files.

I have looked at other methods of downloading genbank files but from the ncbi website I can see the .gbk extension is now .gbff. Could I just download these and change the extension to .gbk or would I likely run into problems?

Furthermore, is there another way I could make this database? I see that in the software code it makes the database files as either .pal or .tar I think.

Sorry if my questions are trivial, any help is muchly appreciated!

database gbk ncbi • 5.0k views
ADD COMMENT
0
Entering edit mode

Okay so after rerunning...

Incorporating genbank\GCA_001931635.1_ASM193163v1_genomic.gbk

Traceback (most recent call last):
  File "makedb.py", line 138, in <module>
  File "makedb.py", line 106, in main
  File "dblib\parse_gbk.pyc", line 721, in parse_gbk_embl
MemoryError

Same error, different accession number..

And I didn't know how to monitor properly, so I just took multiple print screens throughout(see below) https://ibb.co/QF6VZ9v https://ibb.co/x10drbm https://ibb.co/kBL4HZx https://ibb.co/y0zNn0L https://ibb.co/42nzZLd https://ibb.co/KDrVPRH https://ibb.co/bLYVBN3

I still don't know if it is RAM or if the program is bugged however.

ADD REPLY
0
Entering edit mode

It is difficult to say but you have 16G of RAM. If this program is doing things in memory then it will not be enough for 160G of data. @Joe: Do you recall how many genomes you had used?

ADD REPLY
0
Entering edit mode

For myself 2,338 to be precise.

I will continue reading around

Hopefully the developer(s) respond to my query email I have sent them and it may shed some light on this.

ADD REPLY
0
Entering edit mode

Is this a windows only program? Perhaps you can move this to a proper server/linux?

ADD REPLY
0
Entering edit mode

Here is the link to the program: http://multigeneblast.sourceforge.net/

I am using windows currently, I've never used linux. My knowledge relating to computational things is little :P

I will try and run this on my macbook once its repaired

ADD REPLY
0
Entering edit mode

I use Linux and I experience the same problem

ADD REPLY
0
Entering edit mode

I tried to figure this out earlier Genomax, but I think I deleted the database/genomes I used a while back. I know I made a database for every member of the Enterobacteriaceae though, which should have been several hundred, and most likely several thousand, but I did do this on a server with >350GB RAM.

I think its highly likely that you're going to struggle to make a database of that size without dedicated computing resources.

ADD REPLY
0
Entering edit mode

tried the same again and error ever so slightly different this time

Traceback (most recent call last): File "makedb.py", line 138, in <module> File "makedb.py", line 106, in main File "dblib\parse_gbk.pyc", line 751, in parse_gbk_embl File "dblib\parse_gbk.pyc", line 900, in get_sequence File "dblib\utils.pyc", line 212, in get_accession MemoryError

The developer also replied to my e-mail saying that 16GB RAM should be fine to do this. and to probably jsut make smaller databases if this continues

ADD REPLY
0
Entering edit mode

Its quite possible there is a bug somewhere, as its not the tidiest codebase I've ever seen, so there would no doubt be room to optimise the code.

I don't profess to know exactly what the database tool actually does at the stepwise level, or what format a database takes to be useable with the tool, but my gut feeling would be that 16GB RAM is still not enough.

I would start with some super small test cases and see if you can get it to complete at all (e.g. a dozen genomes).

ADD REPLY
0
Entering edit mode

The developer also replied to my e-mail saying that 16GB RAM should be fine to do this.

Did you tell them exact size/number of files you are trying to use? I second @Joe's suggestion of trying this out with a smaller number of genomes first.

ADD REPLY
0
Entering edit mode

Yes I did.

I also encountered an error when I tried making a db from just the Bacillus genomes (765items)

But using Streptomyces (~233 items) it worked I believe but i did see alot of warnings (example below) Warning: non-unique protein accession:tnpB

Warning: non-unique protein accession:tnpB

Warning: non-unique protein accession:tnpB

Warning: non-unique protein accession:tnpB

Warning: non-unique protein accession:tnpB

I think maybe I will try and see if anyone at my university knows how to make the db or has access to a better computer, because obviously I'm using my home desktop. I feel making a database (or databases) is the only major hurdle still.

ADD REPLY
0
Entering edit mode

750+ genomes is still quite a lot of data.

Those warnings are not a particular problem (which is why they're warnings and not errors).

If <250 worked, but >750 didn't, the only explanations I can see is that its still a RAM problem, or a subset of those genomes are troublesome for some reason, but I can't imagine what that would be.

You could perhaps try it with some/all of RefSeq instead. This will be less data, but the quality is significantly higher as they're curated.

ADD REPLY
0
Entering edit mode

I will try RefSeq next instead.

Thanks for all the advice / help you guys have given me so far!

ADD REPLY
0
Entering edit mode

No worries - sorry we can't be more directly useful. It's a pretty niche tool so you're just going to have to keep experimenting I think.

ADD REPLY
0
Entering edit mode

No worries - sorry we can't be more directly useful. It's a pretty niche tool so you're just going to have to keep experimenting I think.

ADD REPLY
0
Entering edit mode

Downloaded all completed genomes from refseq database > extracted all files and changed extensions to .gbk from .gbff

The makedb command managed to read all the files and then got to the point of starting to build the database and then crashed, same error encountered. (print screens below) https://ibb.co/71ZcFB8 https://ibb.co/T0NxPLP

ADD REPLY
0
Entering edit mode

Further update, I have sent someone from my university the folder of 765 files that failed to make a database on my PC for them to try on the server.

Fingers crossed it works

ADD REPLY
0
Entering edit mode

If you can give me the command you were using to download the genomes and make the database, I can give it a go on our server too

ADD REPLY
0
Entering edit mode

That'd be awesome. I'll just summarize everything again, ideally it'd be great if it is possible to make a full database of completed bacterial genomes if not its okay

ADD REPLY
0
Entering edit mode

The aim was to make a database of all completed bacterial genome sequences for multigeneblast:

From where Kai Blin’s tool didn’t work at the time I downloaded all completed genomes for bacteria in .gbff format from https://www.ncbi.nlm.nih.gov/assembly (~50GB compressed)

Now the tool has been updated I guess you could do this command:

ncbi-genome-download -F fasta -s genbank --retries 2 --parallel 4 --no-cache –verbose -o custom_db bacteria

I unzipped all the files and put all the text files into one folder (150GB once unzipped ~18,000 files)

After I ran the python script below, so they’d have the correct extension for MGB to be able to read the files.

import os

os.chdir(r"E:\Custom_DB_bacillus\refseq\bacteria")

new_extension= '.gbk'

for f in os.listdir():
    file_name, file_ext = os.path.splitext(f)

    os.rename(f, file_name + new_extension)

then from the MultiGeneBlast directory using command line I used this:

makedb dbname <folder name with input files>

Obviously, it didn’t work and didn’t work on my macbook either.

I then wanted to just slim the number of files down to see if it would work with less:

ncbi-genome-download -F genbank -s genbank -l complete --genera "Bacillus,Streptomyces,Streptococcus,Frankia,Actinobacteria,Paenibacillus,Longilinea,Propionibacterium,Clostridium,Actinosynnema,Chitinophaga,Pedobacter,Vibrio,Photorhabdus,Solibacter,Caldilinea,Oenoccus" --retries 2 --parallel 4 --no-cache --verbose --o E:\Custom_DB bacteria

this command got approximately 2300 files and I encountered the same error making the db And then doing it for just Bacillus worked (~230 files) but not for streptomyces (~700 files).

ADD REPLY
0
Entering edit mode
4.5 years ago
GenoMax 147k

Could I just download these and change the extension to .gbk or would I likely run into problems?

I would say you can try that since gbff stands for genbank flat file format. There are hundreds of bacterial sequences and your best bet is to use Kai Blin's ncbi-genome-downloader program. NCBI has the new datasets option (LINK). You could also parse Assembly reports that NCBI makes available. That is more work though.

ADD COMMENT
0
Entering edit mode

Thank you!

I've been told about this before and recommended to use a mac to run the downloads.

ADD REPLY
0
Entering edit mode

You would want a unix like OS for sure. You will likely want to run on a server/cluster since the size of the data is going to be pretty big. mac may not be enough.

ADD REPLY
0
Entering edit mode

I am quite computer iliterate not gonna lie :P

I will try and look into more and hopefully learn a thing or two :)

ADD REPLY
0
Entering edit mode

I second Genomax's suggestions of using Kai's tool.

I've used MGB quite a lot, and the database creation tools work ok. As far as I remember, the extensions shouldnt matter, so long as the file itself is in genbank format - but don't quote me on this. If it does complain, you can change the extension to .gbk without any issues.

ADD REPLY
0
Entering edit mode

Thanks for the response too Joe!

I'll have a go at it in the coming days and post an update later in the week :)

ADD REPLY
0
Entering edit mode

Hello again,

so using kai blins tool I typed in: ncbi-genome-download -F genbank -s genbank --retries 2 --parallel --no-vache --verbose -o /Volumes/GJF-500GB bacteria into the command line on terminal.

This began downloading files and i left it overnight. and I have now have this error this morning:

ERROR: Downloading from NCBI failed due to a connection error, retrying. Retries so far: 1
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/ncbi_genome_download/core.py", line 385, in downloadjob_creator_caller
    return create_downloadjob(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/ncbi_genome_download/core.py", line 397, in create_downloadjob
    checksums = grab_checksums_file(entry)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/ncbi_genome_download/core.py", line 465, in grab_checksums_file
    req = requests.get(full_url)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/requests/sessions.py", line 516, in request
    prep = self.prepare_request(req)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/requests/sessions.py", line 449, in prepare_request
    p.prepare(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/requests/models.py", line 314, in prepare
    self.prepare_url(url, params)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/requests/models.py", line 388, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'na/md5checksums.txt': No schema supplied. Perhaps you meant http://na/md5checksums.txt?

"""

I did everything from the command line and followed the info in the read me section. I believe this was also the problem I encountered when I tried with on my windows PC

Any solutions would be much appreciated.

Thank you

ADD REPLY
0
Entering edit mode

Looks, at a glance, like a network error has caused the code not to find a URL it was expecting.

Did it download any entries successfully? Sometimes the connection to NCBI can be flaky, so you may just need to try again.

ADD REPLY
0
Entering edit mode

Okay I will try again it may take awhile!

I will update again tomorrow probably

ADD REPLY
0
Entering edit mode

If it is a persistent problem, it may need raising as a bug at the code repository.

ADD REPLY
0
Entering edit mode

It seems others are posting very similar usage errors on the github now:

This might be a genuine bug or NCBI has changed something and the code needs patching etc.

ADD REPLY
0
Entering edit mode

classic!

In the mean time I downloaded them all from ncbi and am currently in the process of using gunzip extract all the files. Again I will post updates!

ADD REPLY
0
Entering edit mode

Kai has just put a new release out which should fix the problem anyway :)

ADD REPLY
0
Entering edit mode

Hello again!

So I finally got all of the bacteria genome files in .gbk format and in one file. Using windows command line I ran this

E:\MultiGeneBlast>makedb Bacteria_2020 ncbi-genomes-2020-06-18

then I believed it started running.. and then got this error?

Incorporating ncbi-genomes-2020-06-18\GCA_002795825.1_ASM279582v1_genomic.gbk

Incorporating ncbi-genomes-2020-06-18\GCA_000190435.1_ASM19043v1_genomic.gbk

Incorporating ncbi-genomes-2020-06-18\GCA_002101355.1_ASM210135v1_genomic.gbk

Incorporating ncbi-genomes-2020-06-18\GCA_000145235.1_ASM14523v1_genomic.gbk

Traceback (most recent call last): File "makedb.py", line 138, in <module> File "makedb.py", line 106, in main File "dblib\parse_gbk.pyc", line 724, in parse_gbk_embl MemoryError

--is this due to the folder will all the .gbk files is 150GB? or an issue with my PC?

thanks again

ADD REPLY
0
Entering edit mode

I think so. If you had >150G data then this would not work locally if the work is happening in memory.

ADD REPLY
0
Entering edit mode

I've googled abit and some people said it could be a python issue or a RAM issue but still havent solved it yet

ADD REPLY
0
Entering edit mode

i suspect its a RAM issue yes. From what I remember the database creation tools are not massively efficient.

you could try using the webtool instead?

ADD REPLY
0
Entering edit mode

which webtool sorry?

ADD REPLY
0
Entering edit mode

You can run the tool online, if you can't make the DB yourself: http://multigeneblast.sourceforge.net/

ADD REPLY
0
Entering edit mode

Ah okay, I thought you were referring to this just wanted to make sure.

Yeah I tried to use that tool where you specify the divisions you want but that too no longer works

I feel I should read abit more literature and minimise the number of organisms I want to look at and then make a databse from there, just thought if I made a database of all bacterial sequences I wouldnt miss anything out

ADD REPLY
0
Entering edit mode

Update: I've tried to make a smaller custom database (~19GB, ~2300 gbk files. Instead of 18,000 files!)

When using the multigeneblast from the cmd line, I have again run into this error:

Incorporating genbank\GCA_002082195.1_ASM208219v1_genomic.gbk

Traceback (most recent call last):
  File "makedb.py", line 138, in <module>
  File "makedb.py", line 106, in main
  File "dblib\parse_gbk.pyc", line 721, in parse_gbk_embl
MemoryError
  

It must be a bug with the program itself right?

ADD REPLY
0
Entering edit mode

How much memory do you have on this Windows machine? You can watch the task manager after starting the program to confirm that exhaustion of memory leads to this error.

ADD REPLY
0
Entering edit mode

Does it always fail to incorporate that specific accession?

ADD REPLY

Login before adding your answer.

Traffic: 1072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6