Parallel Computing With Key/Value Stores (Multiple Writes)
5
3
Entering edit mode
13.6 years ago
Sequencegeek ▴ 740

Hi everyone,

I've been setting up a data management system using Berkeley DB and the python bindings (bsddb3) to run parallel tasks for ~ 80 million RNA sequences that have properties similar to the "Target RNA" schema below.

Target RNA
length, sequence, expression, mismatchPositions, etc.

Problem:

10 Parallel processes that access separate records ("rows") and write to the database run very slowly. Each job seems to run slower than the previous, perhaps indicating some sort of locking issue? For instance, if I do a test to update the length of 10 million sequences I'll run 10 separate jobs that access different records (0-1,000,000 for the first job, 1,000,001-2,000,000 for the second, etc.) like so:

#n is job number, dbFN is the filelocation of database
import bsddb3
db =  bsddb3.db.DB()
db.set_cachesize(1,0)
db.open(dbFN, None, bsddb3.db.DB_HASH, bsddb3.db.DB_CREATE)
for i in range(n*1,000,000, (n+1)*1,000,000):
    db[str(i)] = str(int(db[str(i)]) + 1)
db.close()

And the runtime (s) for each job finishing are something like: 30, 50, 60, 70, etc. Running only one job to update the whole 10 million sequences will take ~ 120s. It seems to slow down the cluster even though the database is 200 MB total. Running 100 jobs that are read only works fine. I'd like to be able to run 30+ jobs concurrently so not being able to run 10 worries me.

Questions:

  1. Each Processs is accessing different records, why would the speed of the nth job depend on the previous jobs? Doesn't bdb lock on records as opposed to entire DBs? Will manually using transactions speed this up?
  2. In my previous post there was mention of using redis to run 100 jobs running in parallel all accessing the same memory file, is this something berkeley db is not capable of doing because the records aren't stored in ram like redis?
  3. How can I tell which part of the system is causing the problem? Is it a bdb problem or a network/filesystem problem. I'm fairly ignorant about filesystems/networks and don't know how to troubleshoot this type of problem. Any tips would be appreciated.
  4. Is there flags I can pass to allow easier concurrency or db parameters I can change like page sizes to make it more efficient?
  5. If the problem is due to my network/filesystem setup, how is your system set up to all for this type of concurrency?

System Setup/Stats
- bdb 4.8 with bsddb3
- 5 nodes (w/ Gigabit ethernet connections) using NFS
- "dd throughput": 60MB/s
- cache: 1GB for each db connection
- each db is a HASH type.

If any more info is needed I'll update ASAP.

Thanks.

EDIT:

FYI, Berkeley DB does not support synchronous database access by multiple processes FOR NFS (as mentioned in the comments) http://www.oracle.com/technetwork/database/berkeleydb/db-faq-095848.html

Also, Tokyo/Kyoto cabinet do not support simultaneous database access via multiple processes at all - Tokyo/Kyoto Tyrant/Tycoon have to be used instead, although I'm not sure if even these are safe with NFS?

parallel database • 6.7k views
ADD COMMENT
2
Entering edit mode

My experience is that using NFS with multiple nodes accessing the same informations is a very bad idea. I don't know the details about this particular problem, but if it's just reading and writing simple data, you might be better of using mysql. If the problem is more about distributing load to nodes, I'd recommend to use Hadoop.

ADD REPLY
2
Entering edit mode

re: NFS mulitple nodes writing to the same file is bad news, I tried all kids of locking schemes and never got it to work consistently i.e. it would fail one in million times. Finally gave up and just wrote to separate files and aggregated them later. Inelegant, but it works.

ADD REPLY
0
Entering edit mode

I think you will get some good answers on the BDB forum: http://forums.oracle.com/forums/forum.jspa?forumID=271

ADD REPLY
0
Entering edit mode

I'm copying the question there right now but I figured I ask here as well because people might have experience with it.

ADD REPLY
0
Entering edit mode

Thanks for the info everyone. Will a network interface be safe with NFS? It is only using one process/server per database (Tyrant, Tycoon, Redis) so I'd venture that it would be?

ADD REPLY
0
Entering edit mode

@gawp Thanks for the info everyone. Will a network interface be safe with NFS? It is only using one process/server per database (Tyrant, Tycoon, Redis) so I'd venture that it would be?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

@Aleksandr Levchuk: Redis looks amazing, but If I can get a solution that doesn't load all the databases into memory it would be better. I'm currently trying out both Redis and Tyrant. PS: I've referenced your example about loading 100 processes using a single database thread multiple times since that post, thanks for the reply - it was helpful.

ADD REPLY
2
Entering edit mode
13.6 years ago
brentp 24k

It might help to show some more of your code. First, you are using processes, not threads, correct? Though it sounds like if it's slowing down the entire cluster, it's I/O limited anyway.

Next, for 80million records, you should probably use a BTree, as you'll get (in my experience) a lot of collisions using a HASH type and it will slow down quite a lot.

If concurrency is your main concern, you might want to look at Kyoto Cabinet, a DBM with that in mind.

ADD COMMENT
0
Entering edit mode

Code updated. Yes I'm using processes, one cpu for each process. I think it might have to do with I/O also, but I don't know how to confirm? It if was a matter of throughput wouldn't doing 100 read only processes be slow as well, because only the write processes hang. Any suggestions how to verify I/O problems?

ADD REPLY
0
Entering edit mode

when writes hang in mysql it's usually because of indexes, but BDB is a different beast

ADD REPLY
0
Entering edit mode

you could just check top on a node and see if the CPU is load and the load is high, presumably, that should be the case since you're not doing anything computationally intensive. Have you tried tuning the page and bucket size to match your data?

ADD REPLY
2
Entering edit mode
13.6 years ago

Can you devise a means of writing your inserts and updates to file to be executed at once after the select queries are complete? That seems to be a common strategy when concurrency is not required.

ADD COMMENT
0
Entering edit mode

Thanks for the reply. I can update all the writes at the same time after manipulating the data (I think/hope) using transactions. The downside is I have to store the data in ram before writing it and for some jobs that's why I need the random access. If the reason it hangs is because I do too many writes/second then I will definitely change it to do a bulk update, but I don't know if this is the reason and want to verify first.

ADD REPLY
2
Entering edit mode
13.6 years ago
lh3 33k

An unrelated answer. You must have good reasons to use a database, and to modify the entire table in short wall-clock time. But when you want to modify the entire database, it is faster to use a single thread and write out a new table. For 10 million rows and 200MB data, Perl/Python can do most simple operations on a data stream in ~10 seconds.

ADD COMMENT
1
Entering edit mode
10.0 years ago

Each Processs is accessing different records, why would the speed of the nth job depend on the previous jobs?

I don't know (yet).

Doesn't bdb lock on records as opposed to entire DBs?

No, when working with btree and hash backends bsddb doesn't lock only on record. It use page locking. You can tune the page size of the database to reduce the number of lock conflicts and speed things up.

Will manually using transactions speed this up?

I'm not sure what you mean by manual transactions. Transaction will always be slower than no transaction. It's not clear from your question if it is for batch loading of data or application specific. Those are different workload and might be addressed differently.

In my previous post there was mention of using redis to run 100 jobs running in parallel all accessing the same memory file, is this something berkeley db is not capable of doing because the records aren't stored in ram like redis?

Kind of yes. You can configure bsddb to work fully from memory with or without persistence of data, so in the end it's kind of the same, except loading bsddb in memory is quite faster.

How can I tell which part of the system is causing the problem?

By exploration, try different tuning, change the configuration of bsddb...

Is it a bdb problem or a network/filesystem problem.

NFS is a nogo for anything database related, you must use a local harddisk.

If the problem is due to my network/filesystem setup, how is your system set up to all for this type of concurrency?

Check whether your jobs can conflict between each other, if not just don't use transactions at all (make a backup the database)

Regarding BTREE vs HASH, bsddb documentation says:

Choosing between BTree and Hash

For small working datasets that fit entirely in memory, there is no difference between BTree and Hash. Both will perform just as well as the other. In this situation, you might just as well use BTree, if for no other reason than the majority of DB applications use BTree. Note that the main concern here is your working dataset, not your entire dataset. Many applications maintain large amounts of information but only need to access some small portion of that data with any frequency. So what you want to consider is the data that you will routinely use, not the sum total of all the data managed by your application. However, as your working dataset grows to the point where you cannot fit it all into memory, then you need to take more care when choosing your access method. Specifically, choose:

  • BTree if your keys have some locality of reference. That is, if they sort well and you can expect that a query for a given key will likely be followed by a query for one of its neighbors.
  • Hash if your dataset is extremely large. For any given access method, DB must maintain a certain amount of internal information. However, the amount of information that DB must maintain for BTree is much greater than for Hash. The result is that as your dataset grows, this internal information can dominate the cache to the point where there is relatively little space left for application data. As a result, BTree can be forced to perform disk I/O much more frequently than would Hash given the same amount of data.

Moreover, if your dataset becomes so large that DB will almost certainly have to perform disk I/O to satisfy a random request, then Hash will definitely out perform BTree because it has fewer internal records to search through than does BTree.

http://docs.oracle.com/cd/E17076_04/html/toc.htm

ADD COMMENT
0
Entering edit mode
6.2 years ago

I further tested bsddb in multiple processus settings it's a pain.

I know recommend FoundationDB which has a similar API.

Good luck and Happy hacking!

ADD COMMENT

Login before adding your answer.

Traffic: 1815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6