Microarray Dataset Formats: What Should My Program Know How To Read?
1
2
Entering edit mode
14.0 years ago

I've written an application that biologists can use to perform analysis on microarray data. Often the hardest part of using someone else's tool is getting the data into it in the first place. I've defined my own simple data format but that may be a barrier to entry for non-technical users. I'm writing an "Import Dataset" function, designed to suck in a expression data, sample data, and probe descriptions from various formats. My question:

What well-defined formats should my software know how to read?

Assume the expression data are normalized already (e.g. an NCBI GEO Matrix file) rather than CEL files or some other raw data. Text-based formats popularized by well-established tools that encapsulate both sample and probe data would be most helpful.

microarray format software • 3.5k views
ADD COMMENT
6
Entering edit mode
14.0 years ago
User 59 13k

Hmm. If the data is normalised already, chances are it could have been generated from any one of a number of packages for any one of any number of platforms. I'd argue that this was a harder case to cover than just taking in the original data and processing it yourself especially for something like an Affy chip. Parsers for the probe and gene level outputs from Illumina platform should be straightforward, and essential. GEO and ArrayExpress parsers too - already implemented in BioConductor etc. anyway. I don't know whether you're going to want to, or need to, have converters in for various ID types, to be honest I quite often get data with very little annotation information in at all. Most people would like to be able to attach more in this case, and you can't rely on the underlying data source to have it.

I don't know really about standard formats for describing the setup of the experiment, other than the phenodata style tab delimited descriptions used for BioConductor packages.

Let's face it if you're dealing with biologists and their data you'd need to write an import function for Excel files which could also read their minds as to what the contents of said file might be ;)

ADD COMMENT
1
Entering edit mode

+1 for starting with non-normalized data because of better-defined file format standards

ADD REPLY
0
Entering edit mode

The software was written as a stand-alone executable and currently doesn't require R, so it doesn't have native access to bioconductor. While I generally prefer to start from raw data, it's not always available, and sometimes for a quick look I am okay with using Matrix files from GEO. Mostly I need to describe the samples (e.g. "Mutant vs. WT", "Treated vs. Untreated") and know what platform was used.

ADD REPLY
0
Entering edit mode

Fair enough, sounds like quite an endeavour anyway! You don't always need to capture much more than treatment, timepoint and replicate information - at least not in my experience. I still don't think there's a standard format for this, but I like the way GeneSpring does the 'conditions' and 'interpretations' to capture and condense this data. Still end up typing it all in though..

ADD REPLY

Login before adding your answer.

Traffic: 1915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6