Need Help Choosing A Database
4
1
Entering edit mode
13.2 years ago
Science_Robot ★ 1.1k

I have data that sort of looks like this:

SAMPLE_ID,DATA_1,DATA_i,...,DATA_n
1, ...

And I want to be able to query it (using any sort of lambda) like so

select from table where DATA_5834 >= .1

Or..

select from table where ((DATA_x > 0.2) and (DATA_y == "canada"))

I could use SQL however I do not know ahead of time what the columns will be. I don't want to define them ahead of time as there can be between 1 and 10^5 columns.

What DB most appropriately fits this task?

database • 2.5k views
ADD COMMENT
2
Entering edit mode

MySQL is only useful for tables that have a defined number of columns, I think the best idea is to draw up a database schematic, where your database is in 3rd normal form and decide from there.

ADD REPLY
1
Entering edit mode

"however I do not know ahead of time what the columns will be". Exactly, this is a typical scenario where a relational DBMS is not very suitable. This is rather a case for something that can store large matrices or multidimensional arrays of data, e.g. HDF5

ADD REPLY
1
Entering edit mode

You cannot plan, if you dont know what your planning. So I think the best bet is to wait until he/she receives the dataset

ADD REPLY
0
Entering edit mode

@harpalls I have the datasets. There are many. It'd be a waste of time to write schema.

ADD REPLY
2
Entering edit mode
13.2 years ago

For a SQL database and more up to 10E5 point you could use the following schema:

create table MyData
(
id int unsigned primary key auto_increment,
sample_id int unsigned not null indexed,
column_id int  unsigned not null indexed,
is_numeric enum('yes','no'),
content varchar(50) not null
);

when I store my genotypes using a NOSQL key/value engine (e.g. BerkeleyDB). The key is the (chrom/position) and the value is an array of N(genotypes).

as Michael suggested you could also store your data in a binary table, using HDF5 (complicated IMHO) or a simple C-based file: e.g. see my blog post: http://plindenbaum.blogspot.com/2010/04/short-post-plain-text-vs-binary-data.html

ADD COMMENT
2
Entering edit mode
13.2 years ago

It sounds like you have a bunch of microarray data? If so, the suggestion to use normalized "skinny" tables is a good one. Just to add to the discussion, though, take a look at CassandraDB which has an interesting data model that might meet your needs. If you are not so interested in "real-time" queries, hadoop/map reduce or HBASE might be useful, also.

ADD COMMENT
1
Entering edit mode
13.2 years ago
Melanie ▴ 660

I don't know enough about the nosql type solutions to help with those, and maybe the full data set is large enough to warrant going that direction.

But if you want to use MySQL or some other relational database, there are lots of examples of databases that store an arbitrary number of attributes about some entity. The structure for this sort of data is a "long, skinny" table- i.e., you have a table with three columns:

ID Data type Data value

Then you can easily run your query, although if you want to have both numeric and text values in the same column you'll have to use some conversion tricks. I'd probably try to design that away, perhaps doing something like this:

ID Data type Numeric value Text value

Or better: split the numeric and text data into two different tables.

You can get fancier from there- i.e., make data type into a code and control allowed values with another table, put in some foreign keys to indicate which entities these data apply to, etc.

ADD COMMENT
1
Entering edit mode
13.2 years ago

If your data fit in RAM, R software and subset() function could do the job.

Ex:

## my_data dataframe creation
data_1<-c("canada","canada","usa")
data_2<-c(5,2,5)
my_data<-data.frame(data_1,data_2)

## my_query
my_res<-subset(my_data,subset=((data_1=="canada") & (data_2<=3)))

## adding a new column to my_data
data_3<-c("red","green", "blue")
my_data<-data.frame(my_data,data_3)
ADD COMMENT

Login before adding your answer.

Traffic: 1775 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6