A question for those of you out there working on larger bioinformatics analysis:
What sort of platform are you using to store your data and how are you integrating that into your analysis?
My work is part of a larger network of researchers pooling their data. As a result, people need to be able to join their datasets with others' to perform combined analysis, from different parts of the country. Is anyone else in a similar situation? How are you addressing those needs?
For reference, we have genotype data on ~2000 individuals. A mix of 550k and OMNI 1M chips, depending on the run. We have numerous datasets of various phenotypes relating to our area of study for most of those individuals to do trait analysis. We've mostly been doing GWAS with PLINK, but will be doing more with IMPUTE, PennCNV, STRUCTURE and similar applications in the near future. We will also soon be handling whole exome data for several hundred subjects.
I know your pain :-)
Might find this questions of use: [?]Using HDF5 to store bio-data[?]
Might find this question of use, since it appears you're able to use Hadoop and HDF5 together... Using HDF5 to store bio-data.