Hello,
We want to implement a multigenomic database, our goal is to store information from multiple Enterobacteriaceae (E. Coli, Salmonella, Shigella, etc...). This kind of information would be related to its transcriptional regulation (Genes, Products, Transcription Factors, Promoters, Transcription Units... and the final approach would be to store them in MongoDB.
What would be a correct approach to fulfill this requirement?
So far, we have seen three different solutions:
1. The first one is to store all the organisms data in one single database and save the information into different collections based on the entity: collectionGene, collectionProduct, collectionTranscriptionFactor...
With this solution you will have collections per entity from all the organisms, one collection of genes containing all genes of all organisms into one collection.
2. The second one is to store all the organisms data in one single database but in this case it will be through different collections based on the bacterium and the entity that is been stored: ecoli.collectionGene, salmonella.collectionGene, shigella.collectionGene...
With this solution you will have all the organisms in one database but separated through collections, one collection of genes for ecoli organism, one collection genes for salmonella...
3. The third and last one solution is to store one organism per database. For this example we would have 3 MongoDB databases, one for E. Coli, one for Salmonella and one for Shigella. and the collection would be based on the type of entity, like in the first solution but with only one organism information.
With this solution you will have the information from the organisms isolated, but may require more resources for its maintenance.
Final thoughts
Perhaps the first and second solution would be better. So far we have information of E. Coli's regulation in a relational database and is no more than 500mb, databases seem to be small but we will store all data releases versions, so it might end up being more than 10gb for each organism.
Based on this solutions the last one will require more resources in order to maintain it but with the possibility to scale each database independently, but, then again, one database per organism may seem like an overkill...
What are your thoughts for a case scenario like this? I would like to read your opinions
Thanks for your time...
Your data is already in a database. The size is negligible (by database standards) and you are already familiar with it. Why are you moving to MongoDB?
I forgot to mention that it will end up containing all the versions that are being released, so far for E. Coli we got more than 10gb. We are moving to MongoDB because we need a schema less database, the reason is that we can't control the model of each entity and there are cases where the model changes.
The benefit of MondoDB is that it's schema-less, but that is also its flaw. You can store anything in it, but you will eventually need to retrieve those contents. If you have a field in a relational database, you can assume certain things about it. For example, you may have seq_length and it can't be NULL and it must be an integer. When you query for average seq_length, you have some confidence that you are actually getting the average length across all records. With something like MongoDB, you can't guarantee that, so you will need to write additional code to perform those sanity checks.
Another benefit of MongoDB is native partitioning/sharding. If your data size is on the order of GBs (can fit on a single machine), that's not really a concern.
But isn't schema validation fulfilling the requirements for a valid schema?
https://docs.mongodb.com/manual/core/schema-validation/
Perhaps I'm missing something since I have only read it but not applied it.
Thanks for all your help
That's a relatively new feature. Thanks for pointing it out. I haven't been following MonogDB very closely.
Seems odd to have scheme validation in a database that is schema-less.
My recommendation is to not over-optimize unless this is a personal project for learning purposes.
There are three main considerations here all related to scale:
How many concurrent users do you expect to be using this database?
Is this pre-dominantly a read-only database where users query for info?
How large do you expect this database to get?
If you are expecting dozens of concurrent users, predominantly read-only database, and the size to be less than 10GB, I wouldn't even bother with a database. Load it all as a python dictionary and serve the data with Flask.
I already edit the post since I forgot to mention that it might end up being more than 10gb. We also want to share the database so users might load it in their local repository.
Mongodb is just a database engine. It stores binary-encoded JSON underneath. It integrates well with expressJS web applications and related middleware. You may or may not need JSON. You may or may not be writing a web app. If you do, or if you are, if your data are naturally JSON structured, then you might look into Mongodb. If not, look into an indexed SQL database like PostreSQL or SQlite or MariaDB (MySQL).
Thank you for your reply. Yes, our data is naturally JSON structured.