Dear all,
I'l try to import the current TrEMBL database into BioSQL for further analysis. Currently the TrEMBL has about 0.5TB and 107M entries. So far I downloaded the XML and splitted it up into files of roughly 20.000 entries (=6400 files à 90MB). This went reasonably fast. Now I want to import these files into BioSQL. However while reading the file using biopython takes less than a minute. Exporting them to SQL takes more than 17 minutes. With 6400 files x 17 min, I am looking at around 75 days before the insert is complete! Is it my hardware (standard desktop i5, 4GB RAM), which is just no capable to handle this kind of data or am I doing something seriously wrong on the programming site?
Thanks for any performance tips! Best, Jan
Here is my little I use for the import:
from Bio import Entrez
from Bio import SeqIO
from BioSQL import BioSeqDatabase
import itertools
import sys
import cProfile
import time
wait_f_min = 6*60*0
for i in range (1,wait_f_min):
sys.stdout.write("Sleeping for {:5} minutes...\r".format(wait_f_min-i))
sys.stdout.flush()
time.sleep (60)
print ("Connect to server")
server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "nytech", host = "localhost", db="bioseqdb")
if "TrEMBL" in server:
print ("Deleting old database")
del (server["TrEMBL"])
server.commit()
db = server.new_database("TrEMBL", description="all TREMBl proteins Jan/2018")
server.commit()
count = 1
print ("Starting parsing files")
start_time = time.time()
count = 0 # start with file -1
total = 64000 # total amount of files
print ("Start {}".format(time.strftime("%H:%M:%S")))
while True:
loop_start = time.time()
count = count+1
print ("File # {} reading ".format(count), end='')
sys.stdout.flush()
try:
handle = open (r'g:\workfile_'+str(count)+'.xml',"r")
except:
print ("Error opening file!") # probably finished with all files
sys.exit ()
records = list(SeqIO.parse(handle, "uniprot-xml"))
print (" R{:5.2f} ".format((time.time()-loop_start)/60), end='') # Read time for files
sys.stdout.flush()
write_start = time.time()
db.load (records) # Write records to DB
server.commit()
loop_time= time.time() - loop_start
print (" W {:5.2f}".format((time.time()-write_start)/60), end='')
print (" T {:5.2f}".format(loop_time/60),end='')
to_go = loop_time * (total-count) / 3600
print (" to go: {:3.2f}h ".format(to_go))
print ("End {}".format(time.strftime("%H:%M:%S")))
This may be a long shot... Try to move the
server.commit()
outside the while loop. I.e. load everything first and then commit in bulk. By the way, it seems that most of the code of BioSql is 8-10 years old. Maybe it is not tuned to cope with the amount of data you have.This is a comment because it is not at all answering your question. Instead of building your own database why not use our SPARQL.uniprot.org. This is likely to be better modelled than biosql is for the UniProt datamodel.
Otherwise I suspect your hardware is not at the level that one would use for a database that size.
If you just need sequence, organism use the tab download options on www.uniprot.org and use mysql tab/comma delimited file import functionality.
My final goal is to select certain protein sequences (like "all mammalian proteins") and use a regular expression to search for certain motifs.... This might take a while, so I am unsure that a remote database will be able to handle that... am I wrong?
It would probably work ok on our endpoint. But I think we got just the tool you want, namely prosite. I would also have a look at not using a database at all in that case as they will not perform optimally for your use case.
Dear me, thanks for your reply! I know prosite, but after a quick glance over the "pattern language", it does not look like it is powerful enough for my search requests.
I actually do wonder if I just use a taxonomy tree (potentially stored in mysql) and split all TrEMBL entries into "single species files"...
Regards, Jan
Dear dariober,
thanks for your answer. Taking the commit out of the loop was my first iteration of this script. The only major difference is, that whenever the script exits for any reason, the database does a full roll-back, which can actually take longer than the original queing. However, to put numbers to that, I added a timestamp before and after the commit. So here are the results for 20.000 entries:
File # 1 reading R 0.17 W 7.08 C 0.00 T 7.25 to go: 773.24h
On a different aspect, I noticed that due to the structure of BioSQL there are more than XXXX rows created in the various database for the "only" 20.000 entries. Maybe it just a typical mySQL feature (acutally mariaDB in my case) that these amounts of queries take so long? Would another Backend (PostgreSQL or anything else) be faster?
One further thing I noticed. My files should have 20.000 entries but the first one only has 10.000 (that's okay, I splitted the XML mostly on size). However if I look at my database after committing just the first file I have the following NEW rows:
So for roughly 10.000 entries there are neraly 1,000,000 entries in my database. Maybe this is really a problem and I should NOT import all the data (in principle I only need the sequence and the organism...) I will try that and keep this thread update.
However, I am still interested in advice and further tipps!
Best, Jan
Sorry, I can't be more specific as I'm not familiar with BioSQL and, of course, with your task. But...
I wanted to say this in my first comment... Maybe you should move away from BioSQL altogether and parse the xml files to extract what you need. I think having data in a database is useful if you regularly need to do complex queries. For a one-off data extraction step, just parse the xml.
I really would like to have the sequences in a taxonomy tree, to ask for "all mammalian sequences" or "all nematode species" etc... This is difficult if you always have to parse the TB XML file...