Data Too Big To Be Stored In Memory: Common Options
13
9
Entering edit mode
12.0 years ago

I am programming something where I do not need any more sophisticated data storage than Pythons dicts or sets. However, as my data is too big to be stored in memory I have to use something else.

I tried using Sqlite, but heard that it is slow for large datasets ( > 10 gb) and that NoSQL would be better.

What options do you commonly use to work with data that is too large to fit in memory and why? Any standard tools in bioinformatics?

Edit: Perhaps one should include a little bit about when ones answer is good to use and when it isn't?

(Ps. I know this isn't directly bioinformatics related, however I'm sure it is a problem many here has struggled with when working with bioinformatics.)

python • 27k views
ADD COMMENT
7
Entering edit mode

I'll second Micans comment here. If you tell us what you are trying to do, we might be able to help more. There is not a general approach that fits all big data problems.

ADD REPLY
0
Entering edit mode

I just wanted a more general list of techniques. I hope such a thread is OK? Now I have learnt much and know of many ways of working around problems related to working with large datasets.

ADD REPLY
0
Entering edit mode

That is ok. But I still think you will learn more by asking a specific question. What is important is not to know a list of software/methods, but to know which to use in a specific case.

ADD REPLY
5
Entering edit mode

Your question is underpowered. A more specific description will enable more specific answers.

ADD REPLY
0
Entering edit mode

I don't know! Can you tell me more detail?

ADD REPLY
9
Entering edit mode
12.0 years ago
  1. Make sure you are using the correct tool for the job e.g. don't use BLAST for mapping short reads.
  2. Avoid databases - they are absolutely last resort for analytical work.
  3. Preorder you data and process it in chunks that will fit in memory.
  4. If 3. is a problem -> think harder :o)
ADD COMMENT
2
Entering edit mode

I agree. I work with large data sets, but the only time I used a database was for TreeFam, a true database. There are frequently better solutions. If OP could provide more information, this discussion would be more to the point. Btw, on NoSQL, at least MongoDB is only fast only if the data fits in the memory. Its performance is much worse than SQL when your data is too huge. I do not know about others, but I would be surprised if NoSQL in general had much better on-disk performance than SQL.

ADD REPLY
5
Entering edit mode
12.0 years ago
vaskin90 ▴ 290

If you continue using databases even after the answers above, my advice might be helpful.

Database performance is a very very up to the problem that you're trying to solve. It is a common wrong believe that SQLite has a bad performance. No way! Because reliability is a default setting of SQLite. If you switch off journaling, use indexes, use transactions properly (try to do a lot of things in one transaction) and cache prepared queries you will end up with a performance like here (http://www.sqlite.org/speed.html).

We use SQLite in our bioinformatics project and were able to process more than 80 GB of NGS data in minutes with it.

NoSQL would be better (in a general case) when you need scalability and sophisticated queries. When you just need to store your data and do simple queries then SQLite is better.

From our experience, the most difficult case of SQlite optimizations is when you have many small records (for instance SNPs) and you need to iterate over the set a lot. In that case it would be difficult to cache something or optimize. But if you have a bigger records (like sequences) then SQLite with its indexes is perfect.

ADD COMMENT
0
Entering edit mode

Can you provide more details on the way you store NGS data in SQLite db?

ADD REPLY
1
Entering edit mode

Sure,

here is the project: ugene.unipro.ru. It's open source.

Our goal was to develop our Assembly Browser. We tried different techniques for storing reads in SQLite db. Naive attitude with simple index, multitables. But we end up with tiling (like in Google Maps) technique and R-tree index. Each tile (a rectangle) contains a number of reads and when you want to navigate to a specific location of your NGS data you use 4-dementional R-tree index and load only those tiles with reads that you need.

The attitude has one drawback. You need to import your BAM/SAM file in our internal format(with tiles and indexes), for a 80-GB file it will be ~40 minutes. But after that you get a full coverage graph and can instantly navigate to any part of your assembly. It is almost impossible to navigate big NGS data with other programs on the market, because they use another attitude that slows it down.

ADD REPLY
0
Entering edit mode

From the manual, it seems that you do not need BAM once it is imported. If so, I am not sure the purpose of importing an entire BAM. You can collect summary information, mainly read depth, and keep them in a database or in a binary format like IGV. This is going to be a small db/file. Detailed alignment should still be stored in BAM. I guess random access with sqlite is slower than with BAM. A sqlite db is probably much larger than BAM, too. Asking users to replicate data in a larger format only used by ugene is a significant disadvantage.

ADD REPLY
5
Entering edit mode
12.0 years ago
ff.cc.cc ★ 1.3k

I struggled with the same issue a few years ago (genotypes datasets ~ 2GB).

I agree with above good sense answers and tips (avoid relational dbs, split data, refactor and so on), but I know that sometimes you can't re-engineer the problem to work with sequential streams, or you need random access.

My best option was to use hdf5 storage engine. As they state in the site: "HDF technologies address the problems of how to organize, store, discover, access, analyze, share, and preserve data in the face of enormous growth in size and complexity".

I had to build libs from source under windows, but in linux precompiled packages are available in every distro. Then I customized a data format (a bunch of structured data tables) for storing SNPs and gene expressions, working in C\C++. Accessing data is possible i) visually through 3rd party tools like hdfView or intel array visualizer or ii) sistematically through API calls.

Performance are INCREDIBLE: epistasis test (like plink --fast-epi) run as in-memory bed files, also genome-wide eQTL tests on 60 CEU samples run in less than 1 hour.

The core of the code is something like this...

class h5TPED_I {

  protected:
    typedef struct {..    .} T_SNP;
    typedef struct {...} T_Sample;
    typedef struct {...} T_Gene;
    typedef struct {...} T_RefSeq;

   // file metadata
    ...
   // create a new file and its data structure
   virtual bool buildStruct()=0;

   // depndent build methods
   virtual int doDataTable()=0;
   virtual int doSampleTable()=0;
   virtual int doSNPTable()=0;
   virtual int doExpressionTable()=0;

   // setters
   virtual void setData(const std::string &table, const int row, const int col, const T_ExpType val)=0;
   virtual void setData(const std::string &table, const int row, const int col, const T_GType val)=0;
   virtual void setData(const std::string &table, const int row, const T_Sample &val)    =0;
   virtual void setData(const std::string &table, const int row, const T_SNP &val)        =0;
   virtual void setData(const std::string &table, const int row, const T_RefSeq &val)    =0;       
   //virtual void setData(const std::string &table, const int row, const long &val)=0;

   // getters
   virtual void getData(const std::string &table, const int row, const int col, T_ExpType &val)const =0;
   virtual void getData(const std::string &table, const int row, const int col, T_GType &val)    const =0;
   virtual void getData(const std::string &table, const int row, T_Sample &val)const =0;
   virtual void getData(const std::string &table, const int row, T_SNP &val)    const =0;
   virtual void getData(const std::string &table, const int row, T_RefSeq &val)const =0;
   //virtual void getData(const std::string &table, const int row, long &val) const =0;

    ...
   // function to build indexes
   virtual bool buildIndex()

   public:

   // Empty constructor
   h5TPED_I();

   // Constructor from existing file
   h5TPED_I(const std::string &szFilename);

   // val points to memory buffer in which SNP is loaded
   virtual void getSnpPtr(const int row, T_GType *&val, const std::string &table = "/SNPDataTableInv") const = 0;
   virtual void getSnpSubsetMem(const int snpInd, T_GType *val, const size_t mask_sz, const hsize_t *mask, const std::string &table) const {};
   //                        
   virtual void getSamplePtr(const int sampInd, T_GType *&val, const std::string &table = "/SNPDataTable") const = 0;

   //
   virtual void getSampleMem(const int sampInd, T_GType *val, const std::string &table = "/SNPDataTable") const = 0;

   //
   virtual void getGxpPtr(const int row, T_ExpType *&val, const std::string &table = "/ExpDataTable") const =0;

   //
   // General Info ------------------------------------------------------------------------------------------------------------------------
   std::string filename() const { return m_filename; };

   inline unsigned numSamples() const { return m_nSamples; };
   inline unsigned numSnps() const { return m_nSnp; };       
   inline unsigned numChrs() const { return m_nChr; };
   inline unsigned numGenes()const { return m_nGenes; };

   // default value for NA data
   inline T_GType NA() const { return -1; }

    ...

It's hosted on bitbucket and It's still private, since I would like to do some code-cleaning, but It's working fine.

If someone is interested and would work to refinement, plugin, extension or benchmark development please let me know.

ADD COMMENT
1
Entering edit mode

Please edit this to make it readable. Prefix lines of code with 4 spaces.

ADD REPLY
0
Entering edit mode

my problem with HDF5 is that it is very difficult to query when the data should be accessed without an index (e.g. find-by-name). The code is highly verbose too.

ADD REPLY
0
Entering edit mode

Yes, in my case I build a memory map like <rsid, hdf_index="">. A 1MB map can index many thousands of SNPs. Sorry for verbosity. I also encountered a few issues pasting readable code in the answer.

ADD REPLY
0
Entering edit mode

I meant verbose in general (in C, you have to open/close every step of the HDF5 workflow as far as I remember)

ADD REPLY
4
Entering edit mode
12.0 years ago
wdiwdi ▴ 380

If you only need keyword/data storage. as you wrote, and that is your bottleneck, I recommend that you look at the Tokyo Cabinet/Kyoto Cabinet storage engines. These are probably the fastest and most powerful options for this type of storage needs.

ADD COMMENT
2
Entering edit mode
12.0 years ago

I use BerkeleyDB, ( either C or java version). There is a binding for python as far as I remember.

http://www.oracle.com/technetwork/products/berkeleydb/downloads/index.html

My only problem with it, is that it can be very difficult to configure when you want to tune your application (cache, transactions...).

The C package also includes a sqlite software that use the library.

ADD COMMENT
0
Entering edit mode

That was what my advisor recommended, however it has been deprecated since 2.7 http://docs.python.org/2/library/bsddb.html However, as I doubt 2.7 will go out of use soon, perhaps I should try it.

PS: Bindings exist Python3, see http://pypi.python.org/pypi/bsddb3/

ADD REPLY
2
Entering edit mode
12.0 years ago
William ★ 5.3k

If possible always stream trough big datasets. Often it is not neccesary to keep all the data in memory all the time.

Look at the difference for instance between DescriptiveStatistics and SummaryStatistics of Apache Commons Math. Both compute statistics but only SummaryStatistics will work on a big data set because it only keeps one record of the dataset in memory at a single point in time. DescriptiveStatistics crashes on a out of memory error very soon for big data sets.

http://commons.apache.org/math/userguide/stat.html#a1.2_Descriptive_statistics

[quote] DescriptiveStatistics maintains the input data in memory and has the capability of producing "rolling" statistics computed from a "window" consisting of the most recently added values.

SummaryStatistics does not store the input data values in memory, so the statistics included in this aggregate are limited to those that can be computed in one pass through the data without access to the full array of values. [/quote]

The best thing to do (if possible) is to refactor your code so you only have one or a limited set of records in memory at a single point in time.

ADD COMMENT
2
Entering edit mode
12.0 years ago
KCC ★ 4.1k

To what extent can you recode your data? For instance, you might be storing data as a number when it has fewer than 256 unique values and could be stored as a character and decoded as needed. You might be storing some values as floating point, when they only have one decimal place and could be multiplied by 10 and stored as integers. I am a little out of my depth here because you are using python. In C or C++, one has a lot of control over the relative size of the things one is storing.

Also, don't neglect the possibility of buying more RAM vs. how much time it might take you to recode all this. I recently bought more RAM and my big data programming tasks have become vastly simpler.

ADD COMMENT
2
Entering edit mode
12.0 years ago

If your purpose is to deal with fastq or fasta in python, I would recommend you to use the screed module. It parses, indexes and writes your sequences files into a DB file. Then you can read it and use it like if it was a dict() loaded into memory.

ADD COMMENT
2
Entering edit mode
12.0 years ago

Since you use python you can have a look at the libraries for HDF5.

HDF5 is a binary format used in physics and other fields, where there is the need of storing large datasets.

There are a couple of python libraries python which allows to use it more or less like an array. One is called PyTables, and the other HDF5 for Python. PyTables is a bit more advanced, but HDF5 for python works well too. Have a look at their documentation, both libraries are well described.

ADD COMMENT
2
Entering edit mode
10.7 years ago
ole.tange ★ 4.5k

Depending on your data the data structures may compress very well in RAM. If that is the case zram (https://en.wikipedia.org/wiki/Zram) can give a huge boost: Instead of swapping to disk, it swaps to a compressed RAM disk. The compression to RAM is much faster than disk I/O.

ADD COMMENT
1
Entering edit mode
12.0 years ago
Micans ▴ 270

There are good answers already; if you can, always stream or chunk the data, avoid databases. You've given us very little information about the actual problem at hand though. Quite often the particular problem leads to particular solutions. In network analysis one could work with pruned networks, where edges are removed based on absolute weight threshold or on weight ranking. Different software packages can have very different levels of overhead. If your problem has to do with read mapping there will be a host of other considerations. In general, look for ways to reduce problem size AND look for bigger hardware. Run your software on large (but not huge) samples to get an idea of where the bottleneck lies.

Edit: If the goal is deduplication then a simple approach is to do multiple passes, each pass only collecting data that has a particular trait. For reads this could be the first base or the first two bases. A read-specific approach that I've taken is to compress the reads in memory. It should be possible to achieve about 3.5-fold compression using the simplistic 2-bit encoding of 4 bases. I've used a length-encoding approach that is able to handle Ns as well. This just shows that it is possible to improve using particular characteristics of the problem at hand, but we do not know your problem.

ADD COMMENT
1
Entering edit mode
12.0 years ago
kajendiran56 ▴ 120

Some good answers here already. I once wrote a large data file but arranged data in a way that simply using grep within my script and running simultaneously on multiple threads achieved what I wanted. There are better solutions above, just thought I would add this as it has helped me before when I just wanted to avoid spending too much time.

ADD COMMENT
1
Entering edit mode
10.7 years ago
Amos ▴ 50

SciDB might be an alternative, it is designed for some fairly cool in-database computation. It has API's for R and Python.

ADD COMMENT

Login before adding your answer.

Traffic: 1875 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6