Question

Forum:How Do You Manage Your Files & Directories For Your Projects ?

112

Entering edit mode

14.7 years ago

Pierre Lindenbaum 164k

People in a laboratory, working one the same project, generate all kind of files (fasta, images, raw data, statistics, readme.txt, etc...) that will be moved in some directories. How do you manage the hierarchy of those directories ?

there is no standard hierarchy and the file are dropped anywhere. It all relays on the common knowledge.
there is a clearly defined hierarchy (PROJECT_NAME/DATE/machine_user_image_result_but_this_is_the_second_run_because_1st_failed.txt...)
files are uploaded on a wiki (you wouldn't do that for large files)
there is a central file/wiki answering what/where is a file
there is a Readme.txt/describe.xml in each folder.
there is a tool (?) for managing this kind of information ?
(...) ?

Thanks
Pierre

project-management file-management • 35k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 14.7 years ago by Pierre Lindenbaum 164k

5

Entering edit mode

currently, it's a mess :-)

ADD REPLY • link 14.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Pierre: I am wondering how you are managing your files & directories ?

ADD REPLY • link 14.1 years ago by Khader Shameer 18k

0

Entering edit mode

Why do you need files anyway? Files are from the 70's, not going to scale these days.

ADD REPLY • link 13.4 years ago by michaelhavner • 0

7

Entering edit mode

6.9 years ago

Pierre Lindenbaum 164k

7.8 years later:

Leon Eyrich Jessin : How to organize a project

" The most important talk you never heard!"

Hackinars in Bioinformatics

February 8th 2018

https://github.com/leonjessen/talks/blob/master/presentations/20180208_hackinar_project_organisation.pdf

ADD COMMENT • link 6.9 years ago by Pierre Lindenbaum 164k

score 66 · Accepted Answer · 2010-04-22

66

Entering edit mode

14.7 years ago

Giovanni M Dall'Olio 28k

In my local computer, I have:

a 'workspace' folder, in which each sub-folder correspond to a separate project
a 'data' folder where I put all the data used by more than a project
an 'archive' folder with all finished project

Within each project folder, I have:

planning/ -> a folder containing all the files related to the early phase of the project. Usually this is the first folder I create, and here I store all the miscellaneous files (the notes/objectives/initial drafts) that I collect in the first weeks of a projects, when I still am not sure which programs to write.
bugs/ -> I used to use ditz to keep track of bugs and To-Dos, but now I use only A7 hand-written papers
data/
- folders containing the different data I need to use, soft-linked from ~/data
parameters/ -> ideally, I should have configuration files so if I want to run my analysis on other dataset, I only have to change the parameters here
src/ -> with all code
- a Makefile to re-run all the analysis I wish
- scripts/ with all the scripts
- lib/ eventually, if I am reusing code from other projects
- pipelines/ with all .mk (makefile) files
results/
- tables/ -> tabular-like results
- plots/ -> plots
- manuscript/ -> draft for the manuscript, final figures and data, etc..
  - figures/
  - tables/
  - references/

I use git for revision control, to get a log of all the changes I make to scripts and results. Lately I have been reading about sumatra, and planning to give it a try (a slideshow for the curious here)

I still have to decide very well where to put .Rdata files, as I am still a novice to R.

note: you probably know already the article by Plos "A Quick Guide to Organizing Computational Biology Projects"

ADD COMMENT • link 13.3 years ago by Giovanni M Dall'Olio 28k

2

Entering edit mode

I put RData files with my "code" files. (so I load them when I open relevant code files)

ADD REPLY • link 14.7 years ago by Tal Galili ▴ 180

1

Entering edit mode

I usually keep RData files in a "data" directory and use setwd() in my R script to point to it. Doesn't really matter how you do it, so long as source() finds the R script and the R script finds the data.

ADD REPLY • link 14.7 years ago by Neilfws 49k

0

Entering edit mode

thanks!! That it is more or less what I am doing now, but I am not sure whether I should create a separate 'Rdata' directory.

ADD REPLY • link 14.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I only store useful or large objects (like huge matrices from acgh) in .RData files that will then take less disk space. So, sometimes .Rdata files replace original text files in my 'data' folder. This was just a remark...

ADD REPLY • link 14.7 years ago by toni ★ 2.2k

0

Entering edit mode

I asked this question 8 months. It's time to validate the most voted answer :-)

ADD REPLY • link 14.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

thank you very much :-)

ADD REPLY • link 14.0 years ago by Giovanni M Dall'Olio 28k

score 18 · Accepted Answer · 2010-04-23

18

Entering edit mode

14.7 years ago

Jeremy Leipzig 22k

one related tip is a handy bash script I got from:

http://dieter.plaetinck.be/per_directory_bash_history

which produces directory-specific bash histories (instead of one giant global history)

whenever I enter a directory I can easily access everything I ever did there, which is priceless when I am trying to remember what I actually did

ADD COMMENT • link 14.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

a version that allows multiple users to view each others histories: http://jermdemo.blogspot.com/2010/12/directory-based-bash-histories.html

ADD REPLY • link 11.4 years ago by Jeremy Leipzig 22k

score 13 · Accepted Answer · 2010-10-27

13

Entering edit mode

14.2 years ago

Casbon ★ 3.3k

I follow this data plan

ADD COMMENT • link 14.2 years ago by Casbon ★ 3.3k

0

Entering edit mode

Fantastic! Thanks for sharing!

ADD REPLY • link 14.0 years ago by None ▴ 90

0

Entering edit mode

Excellent - made me laugh! Or was that cry.

ADD REPLY • link 13.8 years ago by Niallhaslam 2.3k

0

Entering edit mode

^^ yes this is the one by default. Btw the link is broken, here is a working on http://ivory.idyll.org/blog/data-management.html

ADD REPLY • link 3.6 years ago by Juke34 8.9k

score 11 · Accepted Answer · 2010-12-30

11

Entering edit mode

14.0 years ago

Rvosa ▴ 580

The following article discusses this exact question, and gives useful tips (which I now follow):

William Stafford Noble 2009. A Quick Guide to Organizing Computational Biology Projects. PLoS Comput Biol 5(7): e1000424. http://dx.doi.org/10.1371/journal.pcbi.1000424

ADD COMMENT • link 14.0 years ago by Rvosa ▴ 580

score 10 · Accepted Answer · 2010-04-22

Great question, thanks!

In my opinion there are several layers of files, and different approaches should be applied on each level. Here how it's organized in our lab.

Raw data (microarrays, for example)
- Files are named and stored in clearly defined schema
- Regular backup is mandatory
- Some files, which we probably will not use further (like Affymetrix DAT files) are archived.
- Access to the files is controlled
- General information on experiments is stored in LIMS (we are using Labmatrix, but it's commercial)
- We also store some preprocessed data (normalization, for example) if the procedure is clearly defined as SOP.
Temporary data (ongoing analysis)
- Basically everybody are on their-own here. The files are usually stored locally and everyone responsible for their backup. I can access the data I need remotely (from home, for example).
- I do keep some hierarchy based on projects, data type and analysis, but it's not strict and project-dependent.
- I found Total Commander to be very useful for files management. For example, I can write a small comments for every file (Ctrl-Z), it's stored in a text file, and if I copy or move some file, the description goes with it.
- Files to share for project team we keep on network shared drive with regular backup.
Results to share (documents, figures, tables, ...)
- We are using Backpack from 37signals. Like wiki, but a little easier for non-tech users. Together with Basecamp for project management it's quite good, however it's again commercial and may not suit everybody.

score 8 · Accepted Answer · 2010-04-22

8

Entering edit mode

14.7 years ago

Fred Fleche 4.3k

For analyzed results, documentation, presentation (pdf, ppt, doc, xls) we are using eRoom provided by provided by EMC Corporation
For experimental results it is a mix of your bullet 1 and 2 : There is a clearly defined hierarchy but after a while it all relays on the common knowledge to retrieve information when you want it in urgence.
Some groups are experimenting ELN providing by CambridgeSoft
We are also trying to create small social databases (ie an antibody database where people can share / retrieve their Western Blot experiments in order to avoid different people to test the same antibodies - Theye are able to "rate" the antibodies tested.)

ADD COMMENT • link 14.7 years ago by Fred Fleche 4.3k

0

Entering edit mode

can EMC be accessed programmatically or via the command line?

ADD REPLY • link 14.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

I don't know. Up to now I always used a web browser.

ADD REPLY • link 14.7 years ago by Fred Fleche 4.3k

Ram · Accepted Answer · 2010-12-07

8

Entering edit mode

14.0 years ago

M.Eckart ▴ 90

We're trying to set up the bioinformatic tool epos right now. This tool is or free and made by Thasso Griebel at our university with some help of us. Itself say it's a modular software framework for phylogenetic analysis and visualization.

The MySQL Version is'nt finished yet, but the program is cool and exactly what we needed to sort our stuff.

http://bio.informatik.uni-jena.de/epos/

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.0 years ago by M.Eckart ▴ 90

0

Entering edit mode

It looks like Epos is a system for managing phylogenetic data and analyses only. I think Pierre is asking a general question about how to manage different types of bioinformatics projects. Is Epos extensible to other use cases besides phylogenetics?

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

0

Entering edit mode

It's at first written for phylogenetic analyses - for different types of projects there are powerful interfeaces to other programs too. And of course you canwrite your own, but this unfortunatly is mostly not what a normal user want to. I think it's worth to take a look at cause it's new and not well known already. But it comes around with cool stuff like an own script editor and managing of cluster analysis. We were impressed, but as you mentioned, only for our phylogenetic analyses... So thanks for your comment.

ADD REPLY • link 14.0 years ago by M.Eckart ▴ 90

score 7 · Accepted Answer · 2010-09-10

7

Entering edit mode

14.3 years ago

Niek De Klein ★ 2.6k

Dropbox (http://www.dropbox.com/) is a nice way of keeping all files synchronized. You can invite computers to it and if you drop a file in it, it will automatically be updated on all invited computers.

ADD COMMENT • link 14.3 years ago by Niek De Klein ★ 2.6k

1

Entering edit mode

...as well as being a great way of sharing files with another person

ADD REPLY • link 14.3 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

keep a look at Sparkleshare, http://sparkleshare.org/documentation.html , an open source alternative that allows to use your space on github or gitorious to share files and also uses git to do the versioning of the files.

ADD REPLY • link 14.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

sorry, the correct link is http://sparkleshare.org/

ADD REPLY • link 14.3 years ago by Giovanni M Dall'Olio 28k

Ram · Accepted Answer · 2010-04-22

I have following major directories:

/work - this is where I keep my project directories

/data - raw, unprocessed data

/software - 3rd party software required for the various work flows

/code - general code repo

I update the individual work directory as follows

/work
   | 
   /work/project1 
          |
          create sub-directory based on analysis

For example code, analysis, results etc.

Depending up on the repetitive nature of analysis, I create date based directories to track files generated at different time points. I also keep a README files with directories in project file to make it easier to check the content at a later stage. Big fan of "tree" whenever I need to check the contents in a directories. Irrespective of various data categories I deal with, this format worked for me.

Ram · Accepted Answer · 2010-10-27

We use to have a slightly ad-hoc hierarchy, split on organism and then data type. This causes conflicts and generally doesn't scale (I guess most people working on this are familiar with http://www.shirky.com/writings/ontology_overrated.html ?)

My goal is to have defined datasets (i.e. collection of related files) with each dataset residing in its own subdirectory. In addition to the data files, there will be a metadata file containing relevant metadata - the list of files, their types, checksums, the program or process that generated them, person responsible, and the relationships between datasets.

This way, applications like our BLAST server will be able to trawl the data sets, identify those containing Fasta-files, and add them to the database with correct type and metainformation.

score 5 · Accepted Answer · 2011-02-27

A lot of the answers given here are very useful. But basically I would advocate another approach. Central to any wetlab study is the study description itself. We want to capture that design itself, including the sample and assay description using a Generic Study Capture Framework that is part of the systems biology database [?]dbNP[?]. In this we follow the same philosophy as the generic study description standard ISA-tab does; (reading and writing ISA-tab will be added to dbNP soon), the study description links to the actual raw, clean, statistically evaluated and biologically interpreted data. In this way you don't really have to structure where the files are since you can just find them from GSCF. GSCF is currently under development as part of the open source project dbnp.

Two papers about dbNP were published [?]here[?] and [?]here[?].

Of course the file location storage is just a small aspect of GSCF. It is mainly about ontology based study capturing using NCBO ontologies and queries based on that. It should also facilitate data submission to EBI and NCBI repositories.

score 3 · Accepted Answer · 2011-06-17

3

Entering edit mode

13.5 years ago

Radhouane Aniba ▴ 790

I am personnaly using biocoders.net, I create a private group where I can upload my documents, papers, snippets and codes, using my group calendar to schedule my daily plans etc ...

ADD COMMENT • link 13.5 years ago by Radhouane Aniba ▴ 790

Ram · Accepted Answer · 2011-06-18

3

Entering edit mode

13.5 years ago

Ying W ★ 4.3k

Since someone revived this thread I figure I should add this in.

When organizing your files it is also important to keep reproduciblity in mind. For R there is a package called sweave that is useful for this, alternatives also exist for other languages:

A more general sweave

Doing this might be useful for organizing the results/ and src/ directories

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 13.5 years ago by Ying W ★ 4.3k

score 3 · Accepted Answer · 2011-10-02

3

Entering edit mode

13.2 years ago

Faheemmitha ▴ 210

This project, currently called bixfile is designed as a web based management system, and is (I think) at least tangentially related to your question. This application lets the user upload files via a web interface. The location of all files and folders are stored in a database, and annotation is possible.

ADD COMMENT • link 13.2 years ago by Faheemmitha ▴ 210