This is a follow-up to one of those points, namely the data management issue. Would someone be able to give me good arguments for why it is better to switch over to database-based data management? What are the advantages of this, that I would not be able to do by just keeping everything in files?
Well, you will have to define more clearly what you would want to store in such a databases. Generally speaking, databases are good for relational data.
Also, I don't think it makes sense to explicitly use either.
Perhaps not a strictly bioinformatics-related question, this, but it is so tightly connected to what we need to do in bioinformatics that I still consider it appropriate for this forum. Please let me know if I am wrong about this.
You posed and answered your question in the same sentence: "not a strictly bioinformatics-related question" yet "so tightly connected to what we need to do in bioinformatics". I consider activities connected to bioinformatics to be the subject of bioinformatics questions. So, I think this question is completely appropriate here.
As to the question itself, without a database, what method would you suggest for making queries across all your projects? A script that scans directories and reads standardized flat files? The "management" part implies the ability to gain and navigate some kind of overview. I employ both methods due to a generally un-directed and historically messy design process, but seldom hear about how to "manage" data overviews without a database.
A friend of mine made a suggestion to me just earlier today, that if I'm talking about a rare query that I'm interested in doing across projects, then it might be better to just stick with flat files. He was suggesting that maintaining a database with all its hassles might be a bit excessive for what I would need it for.
tools allow relatively easy construction of GUIs from database models
most programming languages have drivers for RDMS. So you have one data model and can query/report/update with R, java, python etc...
all data in one place (compared to data in folders). This allows you to integrate data across experiments for checking of systematic trends e.g. quality control
evolution will be consistent across all the data. Which will be a little more difficult than an ad hoc change with the current project, but the consistency pays off.
Cons:
Some data structures are more difficult to represent in a relational database e.g. trees
Need some thought (and experience) at the beginning to implement well
For most of my analysis projects I also have a sample csv table. But as the project grows I start to feel the pain (mostly data inconsistencies). We also have some RDBMs for the real stuff of course, but the additional data that I have for the individual projects (additional sample annotation from the researcher) is not entered into the RDBMs, because it has no place there. But I query to RDBMs to check for some consistency.
Be careful not to think in terms of (false) choices. Storing data in databases does not preclude you from also keeping them around in flat files.
Databases are designed to represent/query information stored in a predetermined format. They work best when used in a specialized context and for solving a well defined use case.
In fact you probably would need to create different databases for different use cases.
The more "unified" and "global" your database the more untenable and difficult your task of creating and maintaining them.
Well, you will have to define more clearly what you would want to store in such a databases. Generally speaking, databases are good for relational data.
Also, I don't think it makes sense to explicitly use either.
Perhaps not a strictly bioinformatics-related question, this, but it is so tightly connected to what we need to do in bioinformatics that I still consider it appropriate for this forum. Please let me know if I am wrong about this.
You posed and answered your question in the same sentence: "not a strictly bioinformatics-related question" yet "so tightly connected to what we need to do in bioinformatics". I consider activities connected to bioinformatics to be the subject of bioinformatics questions. So, I think this question is completely appropriate here.
As to the question itself, without a database, what method would you suggest for making queries across all your projects? A script that scans directories and reads standardized flat files? The "management" part implies the ability to gain and navigate some kind of overview. I employ both methods due to a generally un-directed and historically messy design process, but seldom hear about how to "manage" data overviews without a database.
A friend of mine made a suggestion to me just earlier today, that if I'm talking about a rare query that I'm interested in doing across projects, then it might be better to just stick with flat files. He was suggesting that maintaining a database with all its hassles might be a bit excessive for what I would need it for.
http://stackoverflow.com/questions/2356851 "database vs. flat files" ; http://stackoverflow.com/questions/6853482 "Flat file vs database - speed?"; etc...