I struggled with the same issue a few years ago (genotypes datasets ~ 2GB).
I agree with above good sense answers and tips (avoid relational dbs, split data, refactor and so on), but I know that sometimes you can't re-engineer the problem to work with sequential streams, or you need random access.
My best option was to use hdf5 storage engine. As they state in the site: "HDF technologies address the problems of how to organize, store, discover, access, analyze, share, and preserve data in the face of enormous growth in size and complexity".
I had to build libs from source under windows, but in linux precompiled packages are available in every distro. Then I customized a data format (a bunch of structured data tables) for storing SNPs and gene expressions, working in C\C++. Accessing data is possible i) visually through 3rd party tools like hdfView or intel array visualizer or ii) sistematically through API calls.
Performance are INCREDIBLE: epistasis test (like plink --fast-epi) run as in-memory bed files, also genome-wide eQTL tests on 60 CEU samples run in less than 1 hour.
The core of the code is something like this...
class h5TPED_I {
protected:
typedef struct {.. .} T_SNP;
typedef struct {...} T_Sample;
typedef struct {...} T_Gene;
typedef struct {...} T_RefSeq;
// file metadata
...
// create a new file and its data structure
virtual bool buildStruct()=0;
// depndent build methods
virtual int doDataTable()=0;
virtual int doSampleTable()=0;
virtual int doSNPTable()=0;
virtual int doExpressionTable()=0;
// setters
virtual void setData(const std::string &table, const int row, const int col, const T_ExpType val)=0;
virtual void setData(const std::string &table, const int row, const int col, const T_GType val)=0;
virtual void setData(const std::string &table, const int row, const T_Sample &val) =0;
virtual void setData(const std::string &table, const int row, const T_SNP &val) =0;
virtual void setData(const std::string &table, const int row, const T_RefSeq &val) =0;
//virtual void setData(const std::string &table, const int row, const long &val)=0;
// getters
virtual void getData(const std::string &table, const int row, const int col, T_ExpType &val)const =0;
virtual void getData(const std::string &table, const int row, const int col, T_GType &val) const =0;
virtual void getData(const std::string &table, const int row, T_Sample &val)const =0;
virtual void getData(const std::string &table, const int row, T_SNP &val) const =0;
virtual void getData(const std::string &table, const int row, T_RefSeq &val)const =0;
//virtual void getData(const std::string &table, const int row, long &val) const =0;
...
// function to build indexes
virtual bool buildIndex()
public:
// Empty constructor
h5TPED_I();
// Constructor from existing file
h5TPED_I(const std::string &szFilename);
// val points to memory buffer in which SNP is loaded
virtual void getSnpPtr(const int row, T_GType *&val, const std::string &table = "/SNPDataTableInv") const = 0;
virtual void getSnpSubsetMem(const int snpInd, T_GType *val, const size_t mask_sz, const hsize_t *mask, const std::string &table) const {};
//
virtual void getSamplePtr(const int sampInd, T_GType *&val, const std::string &table = "/SNPDataTable") const = 0;
//
virtual void getSampleMem(const int sampInd, T_GType *val, const std::string &table = "/SNPDataTable") const = 0;
//
virtual void getGxpPtr(const int row, T_ExpType *&val, const std::string &table = "/ExpDataTable") const =0;
//
// General Info ------------------------------------------------------------------------------------------------------------------------
std::string filename() const { return m_filename; };
inline unsigned numSamples() const { return m_nSamples; };
inline unsigned numSnps() const { return m_nSnp; };
inline unsigned numChrs() const { return m_nChr; };
inline unsigned numGenes()const { return m_nGenes; };
// default value for NA data
inline T_GType NA() const { return -1; }
...
It's hosted on bitbucket and It's still private, since I would like to do some code-cleaning, but It's working fine.
If someone is interested and would work to refinement, plugin, extension or benchmark development please let me know.
I'll second Micans comment here. If you tell us what you are trying to do, we might be able to help more. There is not a general approach that fits all big data problems.
I just wanted a more general list of techniques. I hope such a thread is OK? Now I have learnt much and know of many ways of working around problems related to working with large datasets.
That is ok. But I still think you will learn more by asking a specific question. What is important is not to know a list of software/methods, but to know which to use in a specific case.
Your question is underpowered. A more specific description will enable more specific answers.
I don't know! Can you tell me more detail?