I've written an application that biologists can use to perform analysis on microarray data. Often the hardest part of using someone else's tool is getting the data into it in the first place. I've defined my own simple data format but that may be a barrier to entry for non-technical users. I'm writing an "Import Dataset" function, designed to suck in a expression data, sample data, and probe descriptions from various formats. My question:
What well-defined formats should my software know how to read?
Assume the expression data are normalized already (e.g. an NCBI GEO Matrix file) rather than CEL files or some other raw data. Text-based formats popularized by well-established tools that encapsulate both sample and probe data would be most helpful.
+1 for starting with non-normalized data because of better-defined file format standards
The software was written as a stand-alone executable and currently doesn't require R, so it doesn't have native access to bioconductor. While I generally prefer to start from raw data, it's not always available, and sometimes for a quick look I am okay with using Matrix files from GEO. Mostly I need to describe the samples (e.g. "Mutant vs. WT", "Treated vs. Untreated") and know what platform was used.
Fair enough, sounds like quite an endeavour anyway! You don't always need to capture much more than treatment, timepoint and replicate information - at least not in my experience. I still don't think there's a standard format for this, but I like the way GeneSpring does the 'conditions' and 'interpretations' to capture and condense this data. Still end up typing it all in though..