Hello all,
I would like to use a relatively old program called MultiGeneBlast (from 2015) to do some work. The database packaged with this software is also from 2015 and since then, I believe there are a lot more bacterial genomes on the ncbi website.
The software has a create your own database (where it downloads genbank files from online and creates a database) but no longer works and only accepts .gbk and .embl files.
I have looked at other methods of downloading genbank files but from the ncbi website I can see the .gbk extension is now .gbff. Could I just download these and change the extension to .gbk or would I likely run into problems?
Furthermore, is there another way I could make this database? I see that in the software code it makes the database files as either .pal or .tar I think.
Sorry if my questions are trivial, any help is muchly appreciated!
Okay so after rerunning...
Same error, different accession number..
And I didn't know how to monitor properly, so I just took multiple print screens throughout(see below) https://ibb.co/QF6VZ9v https://ibb.co/x10drbm https://ibb.co/kBL4HZx https://ibb.co/y0zNn0L https://ibb.co/42nzZLd https://ibb.co/KDrVPRH https://ibb.co/bLYVBN3
I still don't know if it is RAM or if the program is bugged however.
It is difficult to say but you have 16G of RAM. If this program is doing things in memory then it will not be enough for 160G of data. @Joe: Do you recall how many genomes you had used?
For myself 2,338 to be precise.
I will continue reading around
Hopefully the developer(s) respond to my query email I have sent them and it may shed some light on this.
Is this a windows only program? Perhaps you can move this to a proper server/linux?
Here is the link to the program: http://multigeneblast.sourceforge.net/
I am using windows currently, I've never used linux. My knowledge relating to computational things is little :P
I will try and run this on my macbook once its repaired
I use Linux and I experience the same problem
I tried to figure this out earlier Genomax, but I think I deleted the database/genomes I used a while back. I know I made a database for every member of the Enterobacteriaceae though, which should have been several hundred, and most likely several thousand, but I did do this on a server with >350GB RAM.
I think its highly likely that you're going to struggle to make a database of that size without dedicated computing resources.
tried the same again and error ever so slightly different this time
Traceback (most recent call last): File "makedb.py", line 138, in <module> File "makedb.py", line 106, in main File "dblib\parse_gbk.pyc", line 751, in parse_gbk_embl File "dblib\parse_gbk.pyc", line 900, in get_sequence File "dblib\utils.pyc", line 212, in get_accession MemoryError
The developer also replied to my e-mail saying that 16GB RAM should be fine to do this. and to probably jsut make smaller databases if this continues
Its quite possible there is a bug somewhere, as its not the tidiest codebase I've ever seen, so there would no doubt be room to optimise the code.
I don't profess to know exactly what the database tool actually does at the stepwise level, or what format a database takes to be useable with the tool, but my gut feeling would be that 16GB RAM is still not enough.
I would start with some super small test cases and see if you can get it to complete at all (e.g. a dozen genomes).
Did you tell them exact size/number of files you are trying to use? I second @Joe's suggestion of trying this out with a smaller number of genomes first.
Yes I did.
I also encountered an error when I tried making a db from just the Bacillus genomes (765items)
But using Streptomyces (~233 items) it worked I believe but i did see alot of warnings (example below) Warning: non-unique protein accession:tnpB
Warning: non-unique protein accession:tnpB
Warning: non-unique protein accession:tnpB
Warning: non-unique protein accession:tnpB
Warning: non-unique protein accession:tnpB
I think maybe I will try and see if anyone at my university knows how to make the db or has access to a better computer, because obviously I'm using my home desktop. I feel making a database (or databases) is the only major hurdle still.
750+ genomes is still quite a lot of data.
Those warnings are not a particular problem (which is why they're warnings and not errors).
If <250 worked, but >750 didn't, the only explanations I can see is that its still a RAM problem, or a subset of those genomes are troublesome for some reason, but I can't imagine what that would be.
You could perhaps try it with some/all of RefSeq instead. This will be less data, but the quality is significantly higher as they're curated.
I will try RefSeq next instead.
Thanks for all the advice / help you guys have given me so far!
No worries - sorry we can't be more directly useful. It's a pretty niche tool so you're just going to have to keep experimenting I think.
No worries - sorry we can't be more directly useful. It's a pretty niche tool so you're just going to have to keep experimenting I think.
Downloaded all completed genomes from refseq database > extracted all files and changed extensions to .gbk from .gbff
The makedb command managed to read all the files and then got to the point of starting to build the database and then crashed, same error encountered. (print screens below) https://ibb.co/71ZcFB8 https://ibb.co/T0NxPLP
Further update, I have sent someone from my university the folder of 765 files that failed to make a database on my PC for them to try on the server.
Fingers crossed it works
If you can give me the command you were using to download the genomes and make the database, I can give it a go on our server too
That'd be awesome. I'll just summarize everything again, ideally it'd be great if it is possible to make a full database of completed bacterial genomes if not its okay
The aim was to make a database of all completed bacterial genome sequences for multigeneblast:
From where Kai Blin’s tool didn’t work at the time I downloaded all completed genomes for bacteria in .gbff format from https://www.ncbi.nlm.nih.gov/assembly (~50GB compressed)
Now the tool has been updated I guess you could do this command:
I unzipped all the files and put all the text files into one folder (150GB once unzipped ~18,000 files)
After I ran the python script below, so they’d have the correct extension for MGB to be able to read the files.
then from the MultiGeneBlast directory using command line I used this:
Obviously, it didn’t work and didn’t work on my macbook either.
I then wanted to just slim the number of files down to see if it would work with less:
this command got approximately 2300 files and I encountered the same error making the db And then doing it for just Bacillus worked (~230 files) but not for streptomyces (~700 files).