Question

Normalize Large Number Of Cel Files

10

Entering edit mode

14.0 years ago

Mike Dewar ★ 1.6k

I'm trying to normalize a set of 508 CEL files, available on GEO:

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15907

I've been trying to use the R package aroma affymetrix:

http://www.aroma-project.org/

to achieve this, but it's randomly crashing with strange errors. The benefit of the aroma package is that it is supposed to work within finite memory. With this package not working, I'm at a bit of a loss how to proceed.

Am I crazy trying to normalise 508 arrays all at once? Or is this a trivial amount compared to the large scale studies? Any advice would be greatly appreciated!

r bioconductor • 11k views

ADD COMMENT • link updated 14.0 years ago by Markus Schmidberger • 0 • written 14.0 years ago by Mike Dewar ★ 1.6k

0

Entering edit mode

Feel lucky you're not looking to normalize this set :) http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE2109

To echo others, just get more RAM and/or explore the justXYZ() normalization functions.

ADD REPLY • link 13.9 years ago by Geoffjentry ▴ 320

score 9 · Answer 1 · 2010-11-30

9

Entering edit mode

14.0 years ago

User 59 13k

In BioConductor there are RMA and GCRMA functions that operate with much lower memory overheads than other functions. Have a look at justRMA(), justRMALite(), or their GCRMA equivalents.

If you can bear to step outisde of R there is also RMAExpress that should scale enough for you to get expression values out of a large number of arrays.

By the way the link in your post doesn't lead to a GEO entry, I'd be interested to see what experiment has 508 arrays in. Certainly normalising them together would be a question we could only answer if we knew what the experimental design was, what comparisons you wanted to make. You will undoubtedly want to assess potential batch effects in a dataset this large.

ADD COMMENT • link 14.0 years ago by User 59 13k

1

Entering edit mode

I had this very same problem last week (870 samples), and solved it by using justRMA() after asking for the different options I had. AROMA needed me to invest too much time to learn for what I needed, so it was my 3rd option.

ADD REPLY • link 13.9 years ago by Ajene ▴ 90

0

Entering edit mode

oops - I've updated the link. The group of arrays is from the Immunological Genome Project

ADD REPLY • link 14.0 years ago by Mike Dewar ★ 1.6k

score 6 · Answer 2 · 2010-11-30

6

Entering edit mode

14.0 years ago

David Quigley 11k

You're not crazy. Henrik (the Aroma developer) is very actively developing this project, so try posting a specific bug report to the mailing list after you check to see if there is already an existing report for your error listing. Aroma is a nice package, though there are numerous things you have to do to get everything set up correctly.

I'd be surprised if a decent 64-bit linux box (e.g., 16+ GB RAM) couldn't run GCRMA on the whole swadge of files all at once.

ADD COMMENT • link 14.0 years ago by David Quigley 11k

0

Entering edit mode

Seems like the consensus is "get a lot of RAM"! I shall have a go at abusing the computational resources open to me... Thanks for the help

ADD REPLY • link 14.0 years ago by Mike Dewar ★ 1.6k

score 5 · Answer 3 · 2010-11-30

5

Entering edit mode

14.0 years ago

Will 4.6k

I was always a big fan of RefPlus: http://bioinformatics.oxfordjournals.org/content/23/18/2493.full

Its also designed to work in a finite memory and is very useful if you need to add more samples later. I've also never had a problem with it crashing.

ADD COMMENT • link 14.0 years ago by Will 4.6k

score 2 · Answer 4 · 2010-12-01

You're not crazy: we routinely do normalization on several hundred arrays. However, we use RMA in the Bioconductor packages affy or simpleaffy. Also, we use a machine with 128 GB memory :-) but I think you could get away with less than that; perhaps 32 GB minimum.

This page has some simulations to analyse RMA memory use.

score 1 · Answer 5 · 2010-12-02

1

Entering edit mode

14.0 years ago

Pascal ▴ 130

Since you want to analyze Affymetrix Mouse Gene 1.0 ST Arrays, just use the Affymetrix Power Tools. These command line tools can do normalization with < 2GB RAM.

ADD COMMENT • link 14.0 years ago by Pascal ▴ 130

score 0 · Answer 6 · 2010-12-02

You could also have a look at XPS Bioconductor package.

"The package handles pre-processing, normalization, filtering and analysis of Affymetrix GeneChip expression arrays, including exon arrays (Exon 1.0 ST: core, extended, full probesets), gene arrays (Gene 1.0 ST) and plate arrays on computers with 1 GB RAM only. "

score 0 · Answer 7 · 2011-10-21

The possible solutions are, in order of elegance

Use the aroma package
Use the "justRMA"-like functions from the affy package
Switch to a single-array normalization method, like the Affymetrix MAS5, and normalize the files one by one, then merge the exprs(eset).
(Unrecommended) as partially pointed out in this paper you may divide your dataset in >100 array subsets, then use the multi-array normalization methods like RMA or GCRMA (or PLIER or FARMS), and finally merge the output. The result won't be identical as a full batch 500 normalization, but will be close (like <0.01 normalize log expression per probeset). Use this only in combination with point 5, as crazily-behaving samples are the real issue when comparing groups normalized in separate runs.
You can always filter out low-quality samples. A fast method is the deleted residuals approach (see here) which checks if any sample in the dataset significantly diverges from the average expression behavior using the KS test.

Good luck! :-) And no, you are not crazy, I had the same issue with 3700 microarrays once.

score 0 · Answer 8 · 2011-10-23

If you have access to a computer cluster you can try the package "affyPara" (http://www.bioconductor.org/packages/2.8/bioc/html/affyPara.html). It will distribute your data to different machines and supports many functions from the affy package. It will solve your memory problems and accelerate your calculation. Depending on your computer cluster I was able to normalized 12.000 arrays with rma.