I'm curious to know what conventions people adopt in terms of archiving their working analysis, especially when someone leaves the organization. I'm not talking about for publication - there are solutions for that. I'm talking about the equivalent of a lab notebook. Most labs keep their scientific notebooks, as references for future projects, protocols, etc. But for computational types, analysts with directories in the file system, their "notebooks" are typically their directories. Sometimes those directories are full of large files. BAM files, in particular, usually contain the command-line that was used to generate the file, it can be viewed with samtools, e.g.:
samtools view -H accepted_hits.bam
[...]
@PG ID:TopHat VN:2.0.10 CL:/n/local/bin/tophat --GTF mm10.Ens_73.cuff.gtf -p 3 -g 1 -o s_7_1_AGTCAA.tophat bowtie-index/mm10 C337LACXXa/s_7_1_AGTCAA.fastq.gz
Once a researcher leaves, an organization has to make a choice about what to do with their directories. Scientific Notebooks are essentially kept forever. For analysts, I don't see why not to do the same thing in terms of their directories except that keeping large files may not be necessary if they have essentially served their purpose, and the commands that generated them are known. Thus for archiving purposes, if the directory would just otherwise be removed, is there any reason not to simply replace a BAM file with the command used to generate it? i.e. accepted_hits.bam -> accepted_hits.bam.cl (assuming it is not actively being used in any projects?)
What do people do with /home/user or other directories for people that leave? Are there guidelines in place in your organization for dealing with large files that are not primary data?
You can generate fastq from bam but not the opposite, you should archive the software (with dependencies) and reference sequence as well to do so.