Data Selection With/Without Databases (Large Data Sets, Orms, And Speed)
13
13
Entering edit mode
13.7 years ago
Sequencegeek ▴ 740

Hi all!

I'd like to gather some opinions on the use of databases since I noticed that many people that answered questions (@Pierre) use them. Specifically, I'm trying to set up a sqlite ORM for some of my python scripts but I'm worried about speed and was wondering if anyone had tips for working with datasets with > 1 million rows (~2GB flatfile size).

So here we go:

1) Which types of tasks (examples) do you employ your databases for?
2) Are there tasks where you've found they do NOT work at all due to speed/memory?

It seems that It would be beneficial to use a database if possible because It makes selecting the data needed for each step of a project very easy (not to mention every property is already type casted correctly). Instead of creating hash tables and cross-referencing them for every bit of data, the SELECT statement can be used. If you don't prefer to use databases for selecting your data, how do you prefer to do it?

Thanks

EDIT:

Thanks everyone! This is exactly what I was hoping for - a survey of all the options and some tips as to how I should proceed. I've spent a lot of time looking at python bindings to BerkelyDB and how to use it as a key/value store. At the end of the day I think I'll try to create two solutions: one based on berkeleyDB and the other on sqlite (need to learn it anyhow). A lot of programs need to be able to run in parallel so it might be a little tricky...

database python data • 7.3k views
ADD COMMENT
3
Entering edit mode

Perhaps you could ask your additional edit as a second question? Otherwise everyone has to go back and correspondingly edit their answer.

ADD REPLY
1
Entering edit mode

Your targeted RNA data is actually not so large (assuming each entry is not the whole transcriptome). All operations will be so fast that the database choose will not make a difference in performance.

ADD REPLY
0
Entering edit mode

+1 @Michael Barton: Agree.

ADD REPLY
0
Entering edit mode

Your targeted RNA data is actually not large at all (assuming each entry is not a mammalian transcriptome). All operations will be so fast that the database choose will not make a difference in performance.

ADD REPLY
8
Entering edit mode
13.7 years ago
  1. Which types of tasks (examples) do you employ your databases for?

  2. mysql for mirroring the common public databases (ucsc, ensembl ), for the simple databases.

  3. sometimes a java embedded database (derby, hsqldb) when I'm writing a tool but don't need/want a remote server. Using a sql database allows me to quickly check/update/select the data by hand and to find a bug.
  4. I felt in love with BerkeleyDB-JE , a key/value embedded datastore. It's powerful, simple ,fast requires only one dependency (jar). The drawback is that you cannot query the data with a simple 'select', you'll need to write a program.
  5. I wish I found a good tutorial for HDF5. Despite my previous question, I haven't been able to store/query some data in HDF5.

  6. Are there tasks where you've found they do NOT work at all due to speed/memory?

sometimes a simple command line (or a shell script ) will give a faster answer. For example, I dont't imagine people would store some fastq sequences in a database.

  1. and I would add: "Are there tasks where you've found they work but a simpler solution exists" ?

yes, if the size of your dataset is small, you can put everything in a XML file and query it with DOM or XPATH. (or JSON+javascript, or RDF+sparql etc...)

EDIT: about the ORM. I've payed with Hibernate. I found that the ORMs are a problem for the biological databases because most of the data are read-only and the ORMs are designed (?) to handle the large datasets e.g:

List<Snp> mysnp= genome.getSnps(); //BAOUM !

The ORM just adds a layer of complexity. See also: http://stackoverflow.com/questions/398134/what-are-the-advantages-of-using-an-orm/398182

EDIT2: sqlite is now supported by berkeleyDB, so the size of your data shouldn't be a problem. http://www.oracle.com/us/corporate/press/063695

ADD COMMENT
2
Entering edit mode

I've used Hibernate and found it rather bloated and heavy. I think this is possibly due to the restrictions on metaprogramming in Java. My experience of ORMs in Ruby is that they are light and a joy to use. Even for read-only data ORMs can turn scripts filled with concatenation-based generation of SQL queries into easier to read OO code.

ADD REPLY
0
Entering edit mode

thanks for pointing this Michael

ADD REPLY
0
Entering edit mode

@Pierre Lindenbaum: Thanks for all the suggestions, I really appreciate it. I've been doing a lot of research on bdb, redis, and tokyo cabinet. A lot of sources mention tokyo cabinet as the "successor" to bdb, do you have any experience with it?

ADD REPLY
0
Entering edit mode

I played with couchdb, mongodb but I didn't test redis & tokyo. See also Brad's post.

ADD REPLY
6
Entering edit mode
13.7 years ago
Michael Barton ★ 1.9k

As a general rule in software development (or perhaps life in general?) only introduce extra complexity when it becomes too painful to continue otherwise. In this case does the extra setup required for a database offset the pain of your current methods?

If you are working with a flat file with a million rows can you parse the large file iteratively? E.g. using Unix parallel?

 parallel -a /reinvented_blast_parsers.txt tiger_blood.py

For me databases are useful when I want to process a subset of data. For example selecting and using only the entries that match a specific criteria. If your data is as simple as a set of key/values then you might like to take a look at noSQL tools like Redis. You won't need to bother with an ORM and you can just use the database as a fast and persistent hash table.

If you are going to normalise your data, e.g. a table called 'Companies' with a foreign key relationship to a table named 'Employees', then you will probably need something such as mysql/postgres. You can use an SQL query to join the tables together and fetch the required entries. This can still be very slow depending on your database size and how it's normalised. Using the EXPLAIN tool in SQL and properly setting indexes will improve this.

I'm a big fan of Object Relational Mapping (ORMs) and prefer it over SQL. However if there's a particular bottleneck in your program, then dropping down to raw SQL can be very helpful while maintaining the rest of your code in the much easier to read ORM syntax.

ADD COMMENT
5
Entering edit mode
13.7 years ago
lh3 33k

I can only talk about SQL databases as I have little experiences with others such as HDF and google BigTable.

  1. I have only used MySQL for a web-based database. I believe database is also necessary when you develop a pipeline and need to keep lots of statistics.

  2. Most SQL databases are inefficient given huge data: >1 billion records or >500GB in size. I do not use SQL for genomic position/interval data (e.g. BED/GFF/VCF). There are better tools for this task. For huge FASTA files, there are also better strategies than a SQL database.

For a small file of a few GB in size, whether to use a database largely depends on the access pattern and the requirement. If instant random access is essential, of course you need a database. If not, I would rather load the keys and file offsets into memory. This is quite easy and fairly fast. If the file is just 2GB, I may frequently load the entire file into memory. Without splitting every field, doing this will not blow up your machine. Furthermore, for the majority of daily works, working on a stream is enough.

It would be good to give an example about what you want to store.

EDIT: As you give the specific example: for this amount of data, I will definitely not use a SQL, unless you want to get the information of each id instantly. You only need to load the smaller data set into memory and stream the larger data set. This is very easy, lightweight and very fast. Some processing can even be done on the command line with awk.

ADD COMMENT
5
Entering edit mode
13.7 years ago

I once tried to implement a ORM interface for a database of genotype data (link). It was fun but I definitely do not recommend it to you: it is a big effort and it is probably not worth the time.

  1. ORM are slow for some large scale operations. First of all, data uploading takes a lot of time, because the module has to create an object for each row of the table and then upload it. Most of the times you end up by using raw SQL query, which defeats the purpose of implementing an ORM.

  2. The programming languages that have good ORM libraries usually do not have good support for working with tables. For example, python has excellent libraries like SQLAlchemy and Elixir, which makes creating ORM very easy; however, python's support to table-oriented data analysis is terrible, compared to R. There is not an equivalent of the data.frame object in python, unless you use the scipy modules, which unfortunately are still under development. In any case: if you are working with data in tabular format, you need a programming language designed to work on data in tabular format.

  3. ORM makes your scripts less usable by other member of your group. One of the advantages of using a database is that you and other colleagues can access to the same data, and possibly have a common repository of scripts to handle it. However, if your colleagues do not use the programming language you have used for the ORM, they won't be able to use it at all. It is a lot better to define some SQL stored procedures (example)

That being said, that you should not use an ORM, I strongly recommend you to use a database for your data. There are some R libraries to upload a dataframe to a table (RSQlite, RPGSQL), or to query dataframes as if they were SQL tables (sqldf)

ADD COMMENT
4
Entering edit mode
13.7 years ago

Relational databases (and ORMs) have use-cases where they are extraordinarly powerful but also have use-cases where they are terrible.

For example:

  1. Good: selecting one entry based on multiple conditions over many tables
  2. Bad: selecting a million entries based on multiple conditions over many tables

Most ORMs are not designed (or tested) to handle a very large number of return values and if you need such returns you are almost certainly encounter hair raising performance problems..

ADD COMMENT
1
Entering edit mode

I wrote about improving some of my performance issues with ORMs about a while ago - http://bit.ly/gneuLF

ADD REPLY
0
Entering edit mode

In Ruby's ActiveRecord return values can be iterated over in batches to prevent the problem of having many values in memory at once.

ADD REPLY
3
Entering edit mode
13.7 years ago
Joachim ★ 2.9k

Databases can speed-up your data retrieval a lot, but only if you use them properly. So, you need to create indexes on the right columns and it is up to you to pick the quickest query method (for example, use SELECT...IN... instead of SELECT...OR...OR...OR...).

I find databases suitable if you have to look up somewhat related data, but they become very slow if you have a lot of connected information that you want to retrieve. Can you split you data into several tables (think: Excel spreadsheets) and then get what you want without going back and forth between the tables? If the answer is 'yes', then databases are suitable for your cause. If the answer is 'no', then you are better off with a programmatic in-memory solution (use linked lists, hash-tables, ...).

Does your data revolve around a "main" table? If you do have a main-table and your other tables refer to the primary of said main table by foreign key, then you have a "star schema". You can then use software like BioMart ( http://www.biomart.org ) for turning your star schema into a "reverse star", which means that your queries will return results much quicker to you. The transformation is done fully automatically with BioMart, you also get a nice web-interface to your data, a RESTful interface for querying your db via XML, and a rudimentary RESTful SPARQL-interface (next BioMart release, release candidate 6).

Pros:

  • persistent storage
  • query optimisation by the database engine
  • standard query language to retrieve data (rather than using your own libs...)
  • data structure optimisations when using BioMart + nice programmatic interfaces

Cons:

  • slow queries if your data is too interconnected
  • uses more disk-space if you create many indexes (well, not a problem with your 2GB data)

Databases you might want to look at:

  • MySQL with an MyISAM storage engine is probably the easiest-to-use professional database, because it does not force you to use primary/foreign keys. You can switch to the InnoDB storage engine later for performance increase (even though it is not quite clear whether it is really faster), but then you need to start defining primary keys.
  • PostgreSQL is nice too, because it does not let you get away with sloppy schema definitions. It is very similar to MySQL with a InnoDB storage engine.

Last, but not least, I personally do not believe in ORM. These mapping can be the reason for bad performance, even though this does not have to be the case. Keeping it simple, I would always access the database with SQL instead.

ADD COMMENT
0
Entering edit mode

Joachim

Here is an ORM that works with Sqlite https://www.kellermansoftware.com/p-47-net-data-access-layer.aspx

ADD REPLY
3
Entering edit mode
13.7 years ago
Melanie ▴ 660

The size you quote (1 million rows) should be no problem for a relational database in and of itself. The problem will come with the details of how the tables are designed. If you are going to use a relational database, do yourself a huge favor and spend a little bit of time reading up on relational database design. I'm too deep in to really be the right person to ask for a good overview book, but I think the O'Reilly book "Database in Depth" by C.J. Date (one of the giants in the relational field) is a good place to start.

Database design is not terribly complicated, but people do seem to make a lot of rookie mistakes.

ADD COMMENT
3
Entering edit mode
13.7 years ago
Casbon ★ 3.3k

There is a lot of FUD about databases in bioinformatics. Sqlite can be faster than MySQL/Postgres and hand rolled indexes and doesn't require a server - use it first.

Check out this thread for a discussion.

ADD COMMENT
3
Entering edit mode
13.7 years ago

Sometimes what you need is not a database but something completely different.

The Filesystem

No locks. No logs. Operations are not atomic. Probably the fastest on disk storage management imaginable.

Redis

When working with large datasets in memory I had excellent success using a data structure server.

See Redis - http://redis.io

I was able to count all 7-mers that are observable in Uniprot (14 million sequences, 4.5 billion peptides). In a single R instance running for a few days I got all counts for all the 6-mers. The same approach would take months for the 7-mers. With Redis, was able to do the 7-mers in one day by running 100 R scripts in parallel, all accessing the same memory. The memory was managed by one single-threaded process - a Redis server.

NetCDF

For GWAS, at a Bioconductor workshop we used NetCDF to work on multidimensional arrays of SNPs and individuals.

ADD COMMENT
2
Entering edit mode
13.7 years ago
Ketil 4.1k

I think relational data bases work reasonably well when you access data relatively infrequently and unpredictably, and where data is regularly updated or modified. A typical example is a web server providing data access and updates. Also, the data itself should have a relatively simple structure, easily tabulated.

For large-scale analysis, you typically need read-only access to large data sets, and you often need to look at all the data, and often combining sources in ways that are hard to predict in advance (that is, your indexes are likely to be wrong). In that case, you are probably better off just parsing standard files.

ADD COMMENT
2
Entering edit mode
13.7 years ago

Another thing to keep in mind: databases are really good at managing many transactions in parallel and keeping all operations atomic.

The following are present in SQLite, MySQL, and PostgreSQL and impose a significant speed overhead:

  • Locks management
  • Transaction logs
  • Ability to rollback at any moment

These features are completely useless for some computational tasks.

ADD COMMENT
0
Entering edit mode

MySQL's MyISAM is a non-transactional storage engine and could be used if InnoDB seems to perform slowly due to transaction logs.

ADD REPLY
2
Entering edit mode
13.7 years ago
Gareth Palidwor ★ 1.6k
  • I use MySQL for back ends to web interfaces,
  • I use Berkeley DB backed perl tied hashes for caches in web apps and sometimes in scripts to speed things up.
  • I use Lucene for text searching. Most standard tools (Berkeley DB, Relational databases, etc) are not particularly good at handling large amounts of text. I've found Apache Lucene very effective for searching a very large (multi-GB) text corpus; for example near-instantaneous counts of term usage (http://www.ogic.ca/mltrends/) across the entire medline database.
ADD COMMENT
0
Entering edit mode
11.8 years ago
kwete90 • 0

Sequencegeek

Here is an ORM that works with Sqlite https://www.kellermansoftware.com/p-47-net-data-access-layer.aspx

ADD COMMENT

Login before adding your answer.

Traffic: 2003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6