Hello all :)
I use Extended Attributes extensively on my data, to keep track of which reference genome the data was mapped to, how it was mapped, the MD5 checksum of the file, bin size, etc etc, and I find it one of those really useful things that doesn't get the attention it deserves.
If you are not familiar with Extended Attributes, they are simply key/value pairs which you can add to your files that, hopefully, move with the data - http://en.wikipedia.org/wiki/Extended_file_attributes
For example, on Mac OSX:
xattr -w mapping mm9 ./mybam.bam
would store the key 'mapping' with a value of 'mm9' with the file ./mybam.bam
It can be read back with
xattr -p mapping ./mybam.bam
BAM files usually have the reference in the header, but for BigWig/BED/etc data this is very convenient. Another very practical application in my work has been to store the MD5 hashsum in the metadata, because our filenames/paths are always changing (!!), or to detect accidental filtering/truncation of data after it is created. For example, after adding the following two lines to the bashrc on OSX:
writehash() { for file do xattr -w filehash "$(md5 -q "$file")" "$file"; done; }
readhash() { for file do echo -n "$file"' : '; xattr -p filehash "$file"; done; }
It's easy to set the MD5 hash to the file(s) once, and then recall it instantly without having to re-hash the whole multi-gigabyte file(s) so you/your databases don't have to rely on file paths.
I'm sure others can think of some much more creative uses for metadata, and I'd very much like to hear them!
But, before I am really comfortable releasing code that makes use of metadata, I'm curious to know how many filesystems in Bioinformatic production use actually support it. The compute servers where I work do not, mainly because the file system is NFS which has to have Extended Attributes manually enabled when the file system is formatted.
Thus, I would be very grateful if people could comment with a yes or no, so we could get an idea of how prevalent it is. Note, xattr is a Mac binary. Check that wikipedia page for your distro's version - typically something like this should work on Linux:
touch somefile
setfattr -n "user.demo" -v "test" somefile
getfattr -n "user.demo" somefile
Thank you!!! :)
really cool concept. even if it only were to work on a Mac would be useful to a lot of people. Large compute nodes run all kinds of filesytems AFS etc.
Just a note that one can add metadata to (compressed) BED with
starch --note "foo bar baz..."
and retrieve withunstarch --note
, which has the nice feature of being independent of file system. You can put a lot of data in here, like a structured (query-able) and human-readable JSON string.Yes, the concept is nice. In my case, I use it sometimes (on Linux) to tag banks files (SRA for instance) with the URL where they come from. I made once a little app that computed stats on the reads (things like min/max length) and tagged the reads file with them; one can then quickly know information about the bank from these tags without to have parsing the bank again.
Actually, even if such tags are not "inside" the file itself, I like to compare them like MP3 tags :)
Wow I like it! I thought MD5s take a long time to compute, but statistics like pileup-frequencies, coverage, total signal, etc, take orders of magnitude longer - and are frequently re-used in normalization steps, etc. A stats-appending tool for common Bioinformatic filetypes would be very useful :)
(but only if people can actually use Extended Metadata)
Maybe I'm thinking about this wrong - maybe the 'if you build it, they will come' philosophy would be better suited here.