Forum:file and directory management best practices
7
4
Entering edit mode
4.3 years ago
nkinney06 ▴ 140

We've all been there, after a few years and many bioinformatics projects your home directory is cluttered full of data, scripts, readme files, dockerfiles, backed up databases, etc. etc. etc. You've probably got a home-cooked system for keeping track of stuff that is barely sufficient. git hub is great for some projects but it's still a challenge keeping track of everything. I have a serious question about this dilemma. Is there a book or chapter or set of best practices someone can point me to that I can adopt before it's too late? Thanks!

file-management • 2.7k views
ADD COMMENT
4
Entering edit mode

Hopefully you back all of this up on something reliable because without a reliable backup nothing you do would be worth anything should there be a storage failure.

ADD REPLY
2
Entering edit mode
4.3 years ago

I use a little Wordpress blog to write tiny summaries of my work. Basically, Wordpress is a readymade product that puts a database behind what I do. Each post gets its own set of tags. Tags can be used to quickly filter down to things I've worked on in the past.

Github allows tagging projects, but in practice, in my experience, the Github search engine is not really usable for anything other than searching for lines of code. Anything beyond that, and I get a swamp of search results for projects I'm not even part of.

ADD COMMENT
0
Entering edit mode

This seems like a very clever and flexible way to keep notes.

Could I please ask where you store the contents of the Wordpress blog? Is it private only to you on your own machine? Have you found a way to share some content with collaborators?

ADD REPLY
0
Entering edit mode

I have an account on a web hosting provider (Dreamhost) that provides a one-click Wordpress installation via a web-based management console. I make my site public, but it is easy to password-protect a site, if you wanted to limit access to collaborators.

Dreamhost is just one of many web hosting providers, so you could take a look around and see what works for you, if that option sounds workable.

If you needed something fancier, like directory management (to manage multiple accounts) you probably would want to set things up yourself, instead of using a host provider's "one-click" or automated option.

ADD REPLY
2
Entering edit mode
4.3 years ago

At the risk of sounding stodgy, I'd say that there is no substitute for deliberate and thoughtful organization right from the beginning. Search tools sometimes work, but are often more trouble than they're worth. An example would be most university web sites. If you try to do keyword seaches, you get hundreds of results, and the ones that are of interest to you are buried among the larger number that you don't want.

The other point to make about a central blog or list is that you have to be diligent about updating it every time you create new files, or delete them. It's much simpler if you keep the directories self-documenting, and organized in such a way that you can drill down in a few mouse clicks.

A few hints:

  • keep directory names short but meaningful
  • never ever include blank spaces in names of directories or files. Use dashes, dots or underscores
  • use consistent and meaningful file extensions. Where a series of steps has been done on data, document that using serial file extensions. For example, suppose I had file of chitinase proteins called chitinase.fsa. If I did a blastp search of the non-redundant protein database, the HTML output might be saved in chitinase.blastp.nr.html. A similar search of the non-redundant nucleotide database would be chitinase.blastx.nt.html.
  • It is often useful to keep a README file within a directory, documenting what is in that directory. That way, README files aren't part of the problem, they're part of the solution.

I sometimes spend several minutes at the start of a new task, thinking through how to structure a directory hierarchy and how to name files. In the long run, this saves tremendous amounts of frustration when you're trying to find things, or understand the origin of a particular set of files.

Now, if you want something REALLY difficult, why not try to tackle the lab freezer problem :-)

ADD COMMENT
0
Entering edit mode

README is a good idea.

ADD REPLY
2
Entering edit mode
4.3 years ago

I'm naturally very tidy; so, an organised file structure also comes naturally. I can quickly find code that I wrote years ago by merely consulting my 'mental map'. If I don't know the exact location of a particular piece of code, then I'll at least know a higher level directory in which it's located. Finding it then becomes a matter of following the directory structure in line with the strengthened neuronal connections in my brain.

As a general guide:

-- 1 directory per organisation, numbered based on when I started working with them
-- -- An 'Admin' directory (contracts, invoices, other misc stuff)
-- -- A directory per project
-- -- -- 'code' directory
-- -- -- 'input' data directory
-- -- -- 'output' results directory
-- -- -- 'library' directory (re-usable functions, misc GTFs, metadata, etc)
-- -- 'Publications' directory
-- -- 'Presentations' directory

So, my 'root' work directory looks something like:

1_DublinInstitute2000_4
2_Carlow2005_9
3_MarineInstitute2009
...
...
12ClinBio2015_
...
19UCL2020

Then, there would be individual project directories inside each, or, in the case of 12ClinBio2015_, dozens more directories relating to more organisations.

Outside of this, I also have:

-- 'programs' (SAMtools, etc)
-- 'ReferenceMaterial' (genome builds, genome indices, GTFs, etc)
-- 'Developer', for my 4 Bioconductor packages and other ongoing stuff that is in development or on GitHub
-- 'Scripts', a 'free for all' where I have indeed dumped a whole bunch of messy scripts. I rarely touch this anymore
ADD COMMENT
1
Entering edit mode
4.3 years ago
h.mon 35k

The paper A Quick Guide to Organizing Computational Biology Projects is a good place to start. I don't follow the guidelines to the letter, but they inspired me to create a system adapted to my needs and my tastes.

In short, I have ~/bin/, ~/db/, ~/src/ and ~/projects/ folders. Each project I have get assigned a code, with a corresponding folder (e.g. ~/projects/proj1, ~/projects/proj2, and so on). All the analysis scripts used in a project goes under ~/projects/proj1/scripts/, with the results being somewhat arbitrarily stored under ~/projects/proj1/toolX/ or ~/projects/proj1/pipelineY/. The tools are installed in versioned folders, such as ~/bin/bwa/0.7.17/, likewise for databases, when possible. I always wanted to keep a "README" and / or some other kind of metadata, but I am doing a poor job in this regard so far.

I generally deal with small- or medium-sized projects, and the above setup is good enough for my needs. It could certainly be improved, but as it is know, it suits my needs well. I can easily grep and retrieve the information I need, also, the hierarchy of folders helps reduce the clutter and make things easy to find.

ADD COMMENT
3
Entering edit mode
4.3 years ago
Ram 44k

I follow a directory structure similar to h.mon's - except it is nowhere as stringent. My top level directory is ~/Documents/Projects - I don't add a directory directly under ~ because I hate clutter there.

Within this, each major project category (avenue of research, etc - this is team dependent) gets a sub-directory. Under each of those, there might be multiple levels of sub-directories for sub-aspects depending on how broad the category is. The ultimate "leaf" directory has a date-stamp in the format YYYYMMDD as a prefix and hyphen separated project descriptor as the name. Example: 20200101-batch6-rnaseq.

Within each of these, I create an R project. This project has at least 1 R script/notebook, where I document the reason for the project and how each file starting from the raw data was generated and saved. That way, I double click on an R project file to open up a research task and it becomes easier to switch contexts.

Ideally, I'd also connect each of these local directories to high-throughput/high-volume data processing directories I create in a similar structure on our cluster, so one could understand where each file comes from. However, I don't do this so much because I don't have as much free time and it's not worth the effort when everything is organized and timestamped and named with the longest file names I can allow myself (which is ~75-100 characters).

I also have a lab notebook going on in Markdown that serves as a log as well as a to-do list, and I create weekly reports of tasks done and tasks scheduled for the following week. It can get overwhelming sometimes :-)

ADD COMMENT
3
Entering edit mode
4.3 years ago

I don't have much in the way of links, but the way I work is:

  1. THe number one rule is NEVER store any data on a hard disk that is in a desktop or laptop machine. All data must be stored on a networked file system where someone job depends on it being regularly backed up. Normally this means the university file store that can be mounted on both the desktop and the cluster.
  2. All work is done either in a notebook or in a pipeline. No one shall ever type anything at a linux prompt or in a python or R REPL.
  3. Each project has a folder. That folder looks like:

    |-projXXXXX
       | - src/
       |- notebooks/
       | - raw_data/
       | - documents/
       | - web/
       | - figures/
       | -  pipeline1/
            |- README
            |- pipeline_config.yaml
            |- pipeline_database.db
            |- exported files
            |- links to input files
        |- pipeline2/
        |- pipeline3/
        |- README
        |- .Rproj
    
ADD COMMENT
2
Entering edit mode
4.3 years ago

Here is my take on file-based data management (initially written with microscopy image and derived data in mind but should be generally applicable).

ADD COMMENT

Login before adding your answer.

Traffic: 2739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6