Hi all!
I'd like to gather some opinions on the use of databases since I noticed that many people that answered questions (@Pierre) use them. Specifically, I'm trying to set up a sqlite ORM for some of my python scripts but I'm worried about speed and was wondering if anyone had tips for working with datasets with > 1 million rows (~2GB flatfile size).
So here we go:
1) Which types of tasks (examples) do you employ your databases for?
2) Are there tasks where you've found they do NOT work at all due to speed/memory?
It seems that It would be beneficial to use a database if possible because It makes selecting the data needed for each step of a project very easy (not to mention every property is already type casted correctly). Instead of creating hash tables and cross-referencing them for every bit of data, the SELECT statement can be used. If you don't prefer to use databases for selecting your data, how do you prefer to do it?
Thanks
EDIT:
Thanks everyone! This is exactly what I was hoping for - a survey of all the options and some tips as to how I should proceed. I've spent a lot of time looking at python bindings to BerkelyDB and how to use it as a key/value store. At the end of the day I think I'll try to create two solutions: one based on berkeleyDB and the other on sqlite (need to learn it anyhow). A lot of programs need to be able to run in parallel so it might be a little tricky...
Perhaps you could ask your additional edit as a second question? Otherwise everyone has to go back and correspondingly edit their answer.
Your targeted RNA data is actually not so large (assuming each entry is not the whole transcriptome). All operations will be so fast that the database choose will not make a difference in performance.
+1 @Michael Barton: Agree.
Your targeted RNA data is actually not large at all (assuming each entry is not a mammalian transcriptome). All operations will be so fast that the database choose will not make a difference in performance.