I am looking to hear anyone and everyone you don't have to have built a data warehouse but you have worked on servers that were meticulously maintained and organised - also long as you can explain what you liked, disliked features you would have liked when you are definitely qualified to comment to this post :-)
Problem:
We are currently building a smaller data warehouse and could use advice from others particularly on how they organised their data. The warehouse provides dedicated nodes with decent memory and layers of security to comply with numerous regulations. It will store GWAS data such as raw genotype data and QC´ed data and function as a workspace.
We have previously had our raw genotype data stored on servers without having a good order or structure in play. Besides ending up with several duplicate entries of raw data and QC´ed data, we also had severe problems with handing over information, for example, students who finished or otherwise left. usually, the folder only contained the data and information about how and what was done to the data or where it was obtained from was kept on in someone's mind or written down in a thesis. Now some of it might be solvable just by adding and enforcing others to add a README files to each folder (it would be a start). But in some cases, it might also to have other structures that cross-referenced or helped organise biomarkers, phenotype data, cohorts.
Thanks for your time and input
I've moved it to a
Forum
post as this is more a discussion than a question with a finite number of "correct" answers.There are some papers on the subject:
Ten Simple Rules for Creating a Good Data Management Plan
Good enough practices in scientific computing
Great question! Also interested to hear from people.
From my personal experience as an end user, finding the correct data was troublesome because no master file existed that outlined where to find the file, what was done to it, when it was produced/how, other metadata (what specimen it came from) etc.
Thanks for the upvote. When I started my PhD it was in a new location, I had zero experience with the group and everything was more or less handed down via word to mouth. It took me three months just to know who I should ask to get access and what I could perhaps gain access to.. What I wouldn't give for having a Tldr or welcome package..
We have currently implemented a simple spec/tool for storing and managing our reference genome datasets and you can probably try to use our approach or leverage it. Perhaps it might be useful: The link to the tool is here https://compbiocore.github.io/refchef/ This is how the browsable interface looks like https://compbiocore.github.io/refchef-view/ Its pretty simple to implement mainly using git and github and a small python codebase. Hope this is useful for your purposes