Forum:How Do You Manage Your Files & Directories For Your Projects ?
15
112
Entering edit mode
14.7 years ago

People in a laboratory, working one the same project, generate all kind of files (fasta, images, raw data, statistics, readme.txt, etc...) that will be moved in some directories. How do you manage the hierarchy of those directories ?

  • there is no standard hierarchy and the file are dropped anywhere. It all relays on the common knowledge.
  • there is a clearly defined hierarchy (PROJECT_NAME/DATE/machine_user_image_result_but_this_is_the_second_run_because_1st_failed.txt...)
  • files are uploaded on a wiki (you wouldn't do that for large files)
  • there is a central file/wiki answering what/where is a file
  • there is a Readme.txt/describe.xml in each folder.
  • there is a tool (?) for managing this kind of information ?
  • (...) ?

Thanks
Pierre

project-management file-management • 35k views
ADD COMMENT
5
Entering edit mode

currently, it's a mess :-)

ADD REPLY
0
Entering edit mode

Pierre: I am wondering how you are managing your files & directories ?

ADD REPLY
0
Entering edit mode

Why do you need files anyway? Files are from the 70's, not going to scale these days.

ADD REPLY
66
Entering edit mode
14.7 years ago

In my local computer, I have:

  • a 'workspace' folder, in which each sub-folder correspond to a separate project
  • a 'data' folder where I put all the data used by more than a project
  • an 'archive' folder with all finished project

Within each project folder, I have:

  • planning/ -> a folder containing all the files related to the early phase of the project. Usually this is the first folder I create, and here I store all the miscellaneous files (the notes/objectives/initial drafts) that I collect in the first weeks of a projects, when I still am not sure which programs to write.
  • bugs/ -> I used to use ditz to keep track of bugs and To-Dos, but now I use only A7 hand-written papers
  • data/
    • folders containing the different data I need to use, soft-linked from ~/data
  • parameters/ -> ideally, I should have configuration files so if I want to run my analysis on other dataset, I only have to change the parameters here
  • src/ -> with all code
    • a Makefile to re-run all the analysis I wish
    • scripts/ with all the scripts
    • lib/ eventually, if I am reusing code from other projects
    • pipelines/ with all .mk (makefile) files
  • results/
    • tables/ -> tabular-like results
    • plots/ -> plots
    • manuscript/ -> draft for the manuscript, final figures and data, etc..
      • figures/
      • tables/
      • references/

I use git for revision control, to get a log of all the changes I make to scripts and results. Lately I have been reading about sumatra, and planning to give it a try (a slideshow for the curious here)

I still have to decide very well where to put .Rdata files, as I am still a novice to R.

note: you probably know already the article by Plos "A Quick Guide to Organizing Computational Biology Projects"

ADD COMMENT
2
Entering edit mode

I put RData files with my "code" files. (so I load them when I open relevant code files)

ADD REPLY
1
Entering edit mode

I usually keep RData files in a "data" directory and use setwd() in my R script to point to it. Doesn't really matter how you do it, so long as source() finds the R script and the R script finds the data.

ADD REPLY
0
Entering edit mode

thanks!! That it is more or less what I am doing now, but I am not sure whether I should create a separate 'Rdata' directory.

ADD REPLY
0
Entering edit mode

I only store useful or large objects (like huge matrices from acgh) in .RData files that will then take less disk space. So, sometimes .Rdata files replace original text files in my 'data' folder. This was just a remark...

ADD REPLY
0
Entering edit mode

I asked this question 8 months. It's time to validate the most voted answer :-)

ADD REPLY
0
Entering edit mode

thank you very much :-)

ADD REPLY
18
Entering edit mode
14.7 years ago

one related tip is a handy bash script I got from:

http://dieter.plaetinck.be/per_directory_bash_history

which produces directory-specific bash histories (instead of one giant global history)

whenever I enter a directory I can easily access everything I ever did there, which is priceless when I am trying to remember what I actually did

ADD COMMENT
0
Entering edit mode

a version that allows multiple users to view each others histories: http://jermdemo.blogspot.com/2010/12/directory-based-bash-histories.html

ADD REPLY
13
Entering edit mode
14.2 years ago
Casbon ★ 3.3k

I follow this data plan

ADD COMMENT
0
Entering edit mode

Fantastic! Thanks for sharing!

ADD REPLY
0
Entering edit mode

Excellent - made me laugh! Or was that cry.

ADD REPLY
0
Entering edit mode

^^ yes this is the one by default. Btw the link is broken, here is a working on http://ivory.idyll.org/blog/data-management.html

ADD REPLY
11
Entering edit mode
14.0 years ago
Rvosa ▴ 580

The following article discusses this exact question, and gives useful tips (which I now follow):

William Stafford Noble 2009. A Quick Guide to Organizing Computational Biology Projects. PLoS Comput Biol 5(7): e1000424. http://dx.doi.org/10.1371/journal.pcbi.1000424

ADD COMMENT
10
Entering edit mode
14.7 years ago
Yuri ★ 1.7k

Great question, thanks!

In my opinion there are several layers of files, and different approaches should be applied on each level. Here how it's organized in our lab.

  1. Raw data (microarrays, for example)
    • Files are named and stored in clearly defined schema
    • Regular backup is mandatory
    • Some files, which we probably will not use further (like Affymetrix DAT files) are archived.
    • Access to the files is controlled
    • General information on experiments is stored in LIMS (we are using Labmatrix, but it's commercial)
    • We also store some preprocessed data (normalization, for example) if the procedure is clearly defined as SOP.
  2. Temporary data (ongoing analysis)
    • Basically everybody are on their-own here. The files are usually stored locally and everyone responsible for their backup. I can access the data I need remotely (from home, for example).
    • I do keep some hierarchy based on projects, data type and analysis, but it's not strict and project-dependent.
    • I found Total Commander to be very useful for files management. For example, I can write a small comments for every file (Ctrl-Z), it's stored in a text file, and if I copy or move some file, the description goes with it.
    • Files to share for project team we keep on network shared drive with regular backup.
  3. Results to share (documents, figures, tables, ...)
    • We are using Backpack from 37signals. Like wiki, but a little easier for non-tech users. Together with Basecamp for project management it's quite good, however it's again commercial and may not suit everybody.
ADD COMMENT
8
Entering edit mode
14.7 years ago
  • For analyzed results, documentation, presentation (pdf, ppt, doc, xls) we are using eRoom provided by provided by EMC Corporation
  • For experimental results it is a mix of your bullet 1 and 2 : There is a clearly defined hierarchy but after a while it all relays on the common knowledge to retrieve information when you want it in urgence.
  • Some groups are experimenting ELN providing by CambridgeSoft
  • We are also trying to create small social databases (ie an antibody database where people can share / retrieve their Western Blot experiments in order to avoid different people to test the same antibodies - Theye are able to "rate" the antibodies tested.)
ADD COMMENT
0
Entering edit mode

can EMC be accessed programmatically or via the command line?

ADD REPLY
0
Entering edit mode

I don't know. Up to now I always used a web browser.

ADD REPLY
8
Entering edit mode
14.0 years ago
M.Eckart ▴ 90

We're trying to set up the bioinformatic tool epos right now. This tool is or free and made by Thasso Griebel at our university with some help of us. Itself say it's a modular software framework for phylogenetic analysis and visualization.

The MySQL Version is'nt finished yet, but the program is cool and exactly what we needed to sort our stuff.

http://bio.informatik.uni-jena.de/epos/

ADD COMMENT
0
Entering edit mode

It looks like Epos is a system for managing phylogenetic data and analyses only. I think Pierre is asking a general question about how to manage different types of bioinformatics projects. Is Epos extensible to other use cases besides phylogenetics?

ADD REPLY
0
Entering edit mode

It's at first written for phylogenetic analyses - for different types of projects there are powerful interfeaces to other programs too. And of course you canwrite your own, but this unfortunatly is mostly not what a normal user want to. I think it's worth to take a look at cause it's new and not well known already. But it comes around with cool stuff like an own script editor and managing of cluster analysis. We were impressed, but as you mentioned, only for our phylogenetic analyses... So thanks for your comment.

ADD REPLY
7
Entering edit mode
14.3 years ago
Niek De Klein ★ 2.6k

Dropbox (http://www.dropbox.com/) is a nice way of keeping all files synchronized. You can invite computers to it and if you drop a file in it, it will automatically be updated on all invited computers.

ADD COMMENT
1
Entering edit mode

...as well as being a great way of sharing files with another person

ADD REPLY
0
Entering edit mode

keep a look at Sparkleshare, http://sparkleshare.org/documentation.html , an open source alternative that allows to use your space on github or gitorious to share files and also uses git to do the versioning of the files.

ADD REPLY
0
Entering edit mode

sorry, the correct link is http://sparkleshare.org/

ADD REPLY
6
Entering edit mode
14.7 years ago

I have following major directories:

/work - this is where I keep my project directories

/data - raw, unprocessed data

/software - 3rd party software required for the various work flows

/code - general code repo

I update the individual work directory as follows

/work
   | 
   /work/project1 
          |
          create sub-directory based on analysis

For example code, analysis, results etc.

Depending up on the repetitive nature of analysis, I create date based directories to track files generated at different time points. I also keep a README files with directories in project file to make it easier to check the content at a later stage. Big fan of "tree" whenever I need to check the contents in a directories. Irrespective of various data categories I deal with, this format worked for me.

ADD COMMENT
5
Entering edit mode
14.2 years ago
Ketil 4.1k

We use to have a slightly ad-hoc hierarchy, split on organism and then data type. This causes conflicts and generally doesn't scale (I guess most people working on this are familiar with http://www.shirky.com/writings/ontology_overrated.html ?)

My goal is to have defined datasets (i.e. collection of related files) with each dataset residing in its own subdirectory. In addition to the data files, there will be a metadata file containing relevant metadata - the list of files, their types, checksums, the program or process that generated them, person responsible, and the relationships between datasets.

This way, applications like our BLAST server will be able to trawl the data sets, identify those containing Fasta-files, and add them to the database with correct type and metainformation.

ADD COMMENT
0
Entering edit mode

Some more details here.

ADD REPLY
5
Entering edit mode
13.8 years ago

A lot of the answers given here are very useful. But basically I would advocate another approach. Central to any wetlab study is the study description itself. We want to capture that design itself, including the sample and assay description using a Generic Study Capture Framework that is part of the systems biology database [?]dbNP[?]. In this we follow the same philosophy as the generic study description standard ISA-tab does; (reading and writing ISA-tab will be added to dbNP soon), the study description links to the actual raw, clean, statistically evaluated and biologically interpreted data. In this way you don't really have to structure where the files are since you can just find them from GSCF. GSCF is currently under development as part of the open source project dbnp.

Two papers about dbNP were published [?]here[?] and [?]here[?].

Of course the file location storage is just a small aspect of GSCF. It is mainly about ontology based study capturing using NCBO ontologies and queries based on that. It should also facilitate data submission to EBI and NCBI repositories.

ADD COMMENT
3
Entering edit mode
13.5 years ago

I am personnaly using biocoders.net, I create a private group where I can upload my documents, papers, snippets and codes, using my group calendar to schedule my daily plans etc ...

ADD COMMENT
3
Entering edit mode
13.5 years ago
Ying W ★ 4.3k

Since someone revived this thread I figure I should add this in.

When organizing your files it is also important to keep reproduciblity in mind. For R there is a package called sweave that is useful for this, alternatives also exist for other languages:

Doing this might be useful for organizing the results/ and src/ directories

ADD COMMENT
3
Entering edit mode
13.2 years ago
Faheemmitha ▴ 210

This project, currently called bixfile is designed as a web based management system, and is (I think) at least tangentially related to your question. This application lets the user upload files via a web interface. The location of all files and folders are stored in a database, and annotation is possible.

ADD COMMENT
7
Entering edit mode
6.8 years ago

7.8 years later:

Leon Eyrich Jessin : How to organize a project

" The most important talk you never heard!"

Hackinars in Bioinformatics

February 8th 2018

https://github.com/leonjessen/talks/blob/master/presentations/20180208_hackinar_project_organisation.pdf

ADD COMMENT

Login before adding your answer.

Traffic: 2496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6