Re FASTQ: I quite like fastq. It is very concise and efficient to parse. The biggest problem with fastq is we are unable to store meta information. A proper format may be worthy. A few centers have replaced fastq with BAM. But even so, I think fastq will long live.
Re SAM/BAM and BioHDF: There is an effort to keep SAM with BioHDF. I know the conversion tools and the indexer were working two years ago, but it is still in alpha as of now. I tried the latest BioHDF briefly. It produced larger file size and invoked far more read/lseek system calls. This may be worrying if we simultaneously access alignments of 1000 individuals.
Re VCF: Unlike FASTQ and arguably SAM, VCF keeps structured data. I can see the point of improving VCF, but I am not convinced that we can do much better with NoSQL/HDF unless someone show me the right way.
Re specialized format vs. HDF/NoSQL/SQL: A generic database engine can hardly beat a specialized binary format. When file size or accessing speed is really critical, specialized formats such as SRA/BAM almost always win by large. On the other hand, coming up with an efficient and flexible binary format is non-trivial and takes a long time. If data are not frequently accessed by all end users (e.g. trace/intensity), a format built upon HDF/NoSQL is faster to develop and more convenient to access, and thus better.
Re HDF vs. NoSQL: HDF (not BioHDF) has wider adoption in biology. PacBio and NanoPore both adopted HDF to some level. Personally, I like HDF's hierarchical model better. Berkley DB is too simple. Most recent NoSQL engines are too young. It is yet to see how they evolve.
My general view is in NGS, HDF may be a good format for organizing internal data. I think PacBio and NanoPore are taking the right path (I wish Illumian could do the same at the beginning). However, it is not worth exploring NoSQL solutions for the existing end-user data. These solutions are very likely to make data even bigger in size, slower to process and harder to access especially for biologists. I am not sure how "bigger" are your exome data. 1000g has 50TB alignments in BAM. The system so far works. I do not think a generic NoSQL can work as well in this particular application.
I believe anything that let's us abandon the terrible FASTQ 'format' and allows for a more efficient and structured representation will be beneficial.
Amen to that for VCF. But for FASTQ, what kind of structured information are you thinking about, outside of { name, sequence, qual } ?
I wouldn't think about any additional information. I primarily think about something that really is a format in the first place, this requiring a stringent format definition in the first place. Its syntax should be formalized in way that allows parsing, e.g. EBNF, XML schema, whatever. (Actually the first thing I would do is prescribe that each fastq record consists of exactly 4 lines). To increase efficiency a binary format (which e.g. can be defined in HDF5 which makes it platform indep.) or even from scratch aka BAM would do much better, and such binary is well defined by design.
That perhaps I should have described the work environment as such, http://www.hpcinthecloud.com/hpccloud/2012-02-29/cloud_computing_helps_fight_pediatric_cancer.html before asking if NoSQL might excel over HDF .. quoted from the url "Before they'd have to ship hard drives to each other to have that degree of collaboration and now the data is always accessible through the cloud platform.
"We expect to change the way that the clinical medicine is delivered to pediatric cancer patients, and none of this could be done without the cloud," Coffin says emphatically. "With 12 cancer centers collabor