Question

Big Data: Storage And Analysis

10

Entering edit mode

14.1 years ago

ruphos ▴ 100

A question for those of you out there working on larger bioinformatics analysis:

What sort of platform are you using to store your data and how are you integrating that into your analysis?

My work is part of a larger network of researchers pooling their data. As a result, people need to be able to join their datasets with others' to perform combined analysis, from different parts of the country. Is anyone else in a similar situation? How are you addressing those needs?

For reference, we have genotype data on ~2000 individuals. A mix of 550k and OMNI 1M chips, depending on the run. We have numerous datasets of various phenotypes relating to our area of study for most of those individuals to do trait analysis. We've mostly been doing GWAS with PLINK, but will be doing more with IMPUTE, PennCNV, STRUCTURE and similar applications in the near future. We will also soon be handling whole exome data for several hundred subjects.

data • 5.5k views

ADD COMMENT • link updated 13.9 years ago by 1888 ▴ 80 • written 14.1 years ago by ruphos ▴ 100

0

Entering edit mode

I know your pain :-)

ADD REPLY • link 14.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Might find this questions of use: [?]Using HDF5 to store bio-data[?]

ADD REPLY • link 14.1 years ago by Blunders ★ 1.1k

0

Entering edit mode

Might find this question of use, since it appears you're able to use Hadoop and HDF5 together... Using HDF5 to store bio-data.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 14.1 years ago by Blunders ★ 1.1k

score 4 · Answer 1 · 2010-11-03

4

Entering edit mode

14.1 years ago

Pierre Lindenbaum 164k

My two cents:

You should have a look at Deepak Singh's slides about the world of Big Data , Amazon , Hadoop etc... http://www.slideshare.net/mndoci/presentations

Galaxy can be installed on your server(s) and it allows your users to merge/join/etc the NGS data.

On my side, we are currently working on some exome data and I usually handle those data with BerkeleyDB-JE (instead of a classical RDBMS).

As some physicians want to have a closer look at the data, I've created a java webstart application to allow them to view the data (just the VCFs, a few Mo) via a graphical interface.

ADD COMMENT • link 14.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Galaxy looks neat, I'll definitely have to check into that one more. We were looking at key-value based systems, but decided it would take more development than we had resources for. I'm thinking it might still be something to consider as we get into exome data. Thanks!

ADD REPLY • link 14.1 years ago by ruphos ▴ 100

score 2 · Answer 2 · 2010-11-04

2

Entering edit mode

14.1 years ago

apfejes ▴ 160

This sounds exactly like one of my projects. I've developed a database for combining large datasets of snvs and indels, with a java (command line) API for querying against the dataset. It's meant to be used as part of a larger pipeline, but is often used as a stand alone tool. I started it because I was unable to find any other tool that could be used to efficiently compare any number of data sets in a reasonable amount of time.

The trick is really in the implementation, though - looking at 2000 individuals at once and making sense of the trends isn't as easy as it sounds, but it can be done. (We're up to almost 1500 genome, exome and transcriptome libraries, so not quite the 2000 you've got, but at several million snps per library, it adds up fast.)

Anyhow, my database template and the api are open source, and we're just getting ready to submit an application note - I can give you more information if you want, but I'd hate to be spamming my own work here.

ADD COMMENT • link 14.1 years ago by apfejes ▴ 160

3

Entering edit mode

Don't tease us -- definitely give a URL to your work if it's open source and you feel comfortable sharing it. There's nothing wrong with promoting good work you've done, especially when it answers the question.

ADD REPLY • link 14.1 years ago by Brad Chapman 9.7k

0

Entering edit mode

I'm looking into all sorts of options, just to get an idea of what's out there and what other people are doing if nothing else. I'd be more than happy to take a look at your project.

ADD REPLY • link 14.1 years ago by ruphos ▴ 100

0

Entering edit mode

As Brad puts it if it is already open source why not give out the link to it.

ADD REPLY • link 14.1 years ago by Istvan Albert 101k

0

Entering edit mode

I obviously don't visit enough, since I didn't know there were comments. My work is part of the "Vancouver Short Read Analysis Package" on Sourceforge.

Sorry for taking so long to reply.

ADD REPLY • link 14.0 years ago by apfejes ▴ 160

score 0 · Answer 3 · 2011-05-19

0

Entering edit mode

13.5 years ago

1888 ▴ 80

Hi all genetics/genomics researchers,

Just to continue on this thread...Are there any databases availablle to pool all the exomes data that the different groups are generating? I know it is not easy to put this public, because sequencing is still so expensive, but I think it is time to do that to understand more about diseases...Don't you think?