Normalize Large Number Of Cel Files
8
10
Entering edit mode
14.1 years ago
Mike Dewar ★ 1.6k

I'm trying to normalize a set of 508 CEL files, available on GEO:

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15907

I've been trying to use the R package aroma affymetrix:

http://www.aroma-project.org/

to achieve this, but it's randomly crashing with strange errors. The benefit of the aroma package is that it is supposed to work within finite memory. With this package not working, I'm at a bit of a loss how to proceed.

Am I crazy trying to normalise 508 arrays all at once? Or is this a trivial amount compared to the large scale studies? Any advice would be greatly appreciated!

r bioconductor • 11k views
ADD COMMENT
0
Entering edit mode

Feel lucky you're not looking to normalize this set :) http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE2109

To echo others, just get more RAM and/or explore the justXYZ() normalization functions.

ADD REPLY
9
Entering edit mode
14.1 years ago
User 59 13k

In BioConductor there are RMA and GCRMA functions that operate with much lower memory overheads than other functions. Have a look at justRMA(), justRMALite(), or their GCRMA equivalents.

If you can bear to step outisde of R there is also RMAExpress that should scale enough for you to get expression values out of a large number of arrays.

By the way the link in your post doesn't lead to a GEO entry, I'd be interested to see what experiment has 508 arrays in. Certainly normalising them together would be a question we could only answer if we knew what the experimental design was, what comparisons you wanted to make. You will undoubtedly want to assess potential batch effects in a dataset this large.

ADD COMMENT
1
Entering edit mode

I had this very same problem last week (870 samples), and solved it by using justRMA() after asking for the different options I had. AROMA needed me to invest too much time to learn for what I needed, so it was my 3rd option.

ADD REPLY
0
Entering edit mode

oops - I've updated the link. The group of arrays is from the Immunological Genome Project

ADD REPLY
6
Entering edit mode
14.1 years ago

You're not crazy. Henrik (the Aroma developer) is very actively developing this project, so try posting a specific bug report to the mailing list after you check to see if there is already an existing report for your error listing. Aroma is a nice package, though there are numerous things you have to do to get everything set up correctly.

I'd be surprised if a decent 64-bit linux box (e.g., 16+ GB RAM) couldn't run GCRMA on the whole swadge of files all at once.

ADD COMMENT
0
Entering edit mode

Seems like the consensus is "get a lot of RAM"! I shall have a go at abusing the computational resources open to me... Thanks for the help

ADD REPLY
5
Entering edit mode
14.1 years ago
Will 4.6k

I was always a big fan of RefPlus: http://bioinformatics.oxfordjournals.org/content/23/18/2493.full

Its also designed to work in a finite memory and is very useful if you need to add more samples later. I've also never had a problem with it crashing.

ADD COMMENT
2
Entering edit mode
14.1 years ago
Neilfws 49k

You're not crazy: we routinely do normalization on several hundred arrays. However, we use RMA in the Bioconductor packages affy or simpleaffy. Also, we use a machine with 128 GB memory :-) but I think you could get away with less than that; perhaps 32 GB minimum.

This page has some simulations to analyse RMA memory use.

ADD COMMENT
1
Entering edit mode
14.1 years ago
Pascal ▴ 130

Since you want to analyze Affymetrix Mouse Gene 1.0 ST Arrays, just use the Affymetrix Power Tools. These command line tools can do normalization with < 2GB RAM.

ADD COMMENT
0
Entering edit mode
14.1 years ago
Puthier ▴ 250

You could also have a look at XPS Bioconductor package.

"The package handles pre-processing, normalization, filtering and analysis of Affymetrix GeneChip expression arrays, including exon arrays (Exon 1.0 ST: core, extended, full probesets), gene arrays (Gene 1.0 ST) and plate arrays on computers with 1 GB RAM only. "

ADD COMMENT
0
Entering edit mode
13.2 years ago

The possible solutions are, in order of elegance

  1. Use the aroma package
  2. Use the "justRMA"-like functions from the affy package
  3. Switch to a single-array normalization method, like the Affymetrix MAS5, and normalize the files one by one, then merge the exprs(eset).
  4. (Unrecommended) as partially pointed out in this paper you may divide your dataset in >100 array subsets, then use the multi-array normalization methods like RMA or GCRMA (or PLIER or FARMS), and finally merge the output. The result won't be identical as a full batch 500 normalization, but will be close (like <0.01 normalize log expression per probeset). Use this only in combination with point 5, as crazily-behaving samples are the real issue when comparing groups normalized in separate runs.
  5. You can always filter out low-quality samples. A fast method is the deleted residuals approach (see here) which checks if any sample in the dataset significantly diverges from the average expression behavior using the KS test.

Good luck! :-) And no, you are not crazy, I had the same issue with 3700 microarrays once.

ADD COMMENT
0
Entering edit mode
13.2 years ago

If you have access to a computer cluster you can try the package "affyPara" (http://www.bioconductor.org/packages/2.8/bioc/html/affyPara.html). It will distribute your data to different machines and supports many functions from the affy package. It will solve your memory problems and accelerate your calculation. Depending on your computer cluster I was able to normalized 12.000 arrays with rma.

ADD COMMENT

Login before adding your answer.

Traffic: 3546 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6