Question

Forum:Data Base management system for data lake: Cassandra vs MongoDB vs Apache HBase

1

Entering edit mode

2.9 years ago

ResearchR ▴ 120

We are currently setting up a private cloud in our research department. Our goal is to improve data handling, organization and analysis of our various data types and structures (in total approx. 350TB). Now I have to choose a data base management system, which is capable of handling a data lake efficiently. Since I am not an expert in data bases, I would really appreciate some constructive comments on my thoughts. Here we go:

Given that we deal with semi- or rather unstructured data, I will use a NoSQL data base
I would choose between MongoDB, Cassandra or Apache HBase. With MongoDB I can enjoy a JSON-based document structure, although MapReduce implementations still remain a slow process (deprecated from version 5 onwards!) and memory hogging is still an issue. MongoDB stores data in chuncks of 64MB. I am bit worried about the performance when feeding in data-points of 100GB to 150GB size. Just like HBase, replication is based on a Master-Slave principle, which could be error prone in an event of server failure. Cassandra on the other hand stores data in columns and rows with a SQL-like syntax and a Masterless-Ring replication. With HBase, I cloud rely on HDFS, Zookeeper and rest of the Apache-Crew.

In general I have a tendency to use MongoDB, but I am not 100% convinced. What would you say?

Thanks in advance for some critical comments!

Cheers guys!

HBase Cassandra MongoDB Data-Lake • 2.0k views

ADD COMMENT • link updated 2.9 years ago by Carambakaracho ★ 3.3k • written 2.9 years ago by ResearchR ▴ 120

1

Entering edit mode

For this forum, I guess you'd need to connect a bit more to the bioinformatics part. In general I'm not sure you target the best community here, though there's undoubtfully one or the other nosql database specialist here, you might be able to find many more elsewhere.

For example I have no more than some beginners experience with nosql databases (specifically MongoDB). However, I tend to argue that most data in bioinformatics is actually quite structured and fits relatively well in a SQL database (called variants, genome annotation like genes, transcripts and proteins, etc). The most unstructured part often is the metadata for sample specimens and even that can be structured with comparably low effort. In my opinion the only gain I get from a nosql database is to save me from schema updates.

What are the "datapoints of 100GB to 150GB size" you want to feed in? JSON or binary BAM files?

To protect for server failure, you can set up a sharded cluster with replica sets, so you don't have to worry too much about failures.

ADD REPLY • link 2.9 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Thanks for the reply and your opinion. There reason why I am referring to a data lake is because of our data variety. We process genome files (BAM), MRT- as well as PET-scans, histological images and lab reports. That is also why I think noSQL is appropriate. In fact, I had genome files (BAM) in mind when I was writing about data points of 100GB to 150GB in size. Concerning your suggestion with shared cluster with replica: I also found this when studying MongoDB a bit more in detail. HBase with HDFS also creates replica sets while writing data in a shared system (if I got the correctly). I just wanted to hear an opinion, which seems more appropriate for my usecase.

ADD REPLY • link 2.9 years ago by ResearchR ▴ 120

0

Entering edit mode

Your task and location almost make me believe we’re working in the same company. As Istvan Albert stated below, I’d recommend to go with links to your data lake, too.

Also I’m pretty sure you’ll end up with performance issues when mangling big blobs and documents in the same data store, see for example this anecdotal SO question. Maybe the CephFS mentioned there could be useful to you, too.

ADD REPLY • link 2.9 years ago by Carambakaracho ★ 3.3k

1

Entering edit mode

Maybe the CephFS mentioned there could be useful to you, too.

I am already using CephFS for a different project and I am very happy with its performance. When it comes to distributed FS, MooseFS is also an option. This is actually a very good idea, I cloud try to couple CephFS/ MooseFS with my data base management of choice. Let's see how to implement this. Thanks!

ADD REPLY • link 2.9 years ago by ResearchR ▴ 120

score 1 · Answer 1 · 2022-01-10

1

Entering edit mode

2.9 years ago

Istvan Albert 101k

Every time I explored using MongoDB (or noSQL) for that matter, I have regretted it.

BAM files should remain BAM files, there is no point in putting them into a database. Same with images etc.

Metadata on the other hand, even if it is JSON could easily be stored in Postgres, then you have the option of factoring out any field into a real, queryable data structure later.

But if you are not storing the data, only the metadata, your data size is most certainly a tiny fraction of the 350TB

ADD COMMENT • link 2.9 years ago by Istvan Albert 101k

0

Entering edit mode

BAM files should remain BAM files, there is no point in putting them into a database. Same with images etc.

Valid point. I guess you suggest to store the file path in the data base and not the actual data. The only concern I have is: once we have to migrate data or run any other process affecting the file storage (permission rights, change in path or broken links etc), this could cause some major headache.

But if you are not storing the data, only the metadata, your data size is most certainly a tiny fraction of the 350TB

Absolutely correct. If I focus only on the meta data, the size is comparatively negligible.

ADD REPLY • link 2.9 years ago by ResearchR ▴ 120

0

Entering edit mode

yes, it is a valid point that having data represented in two different ways can lead to complexities

on the other hand, all analyses expect a BAM file as a physical file on the disk. Stuffing a BAM file into a database is of little use, you will have to extract the whole thing, moreover none of these databases were designed to shuffle gigantic files in and out. You may tag your BAM file, add more information to it etc.

Pushing databases into uses-cases they were not designed for will lead to far bigger headaches later down the road.

ADD REPLY • link 2.9 years ago by Istvan Albert 101k