Normalization for microarrays >1000+ samples?
0
0
Entering edit mode
14 months ago
evmae • 0

Hi all, I am trying to normalize microarray data starting from CEL files of 1000+ patients. It is The Affymetrix HTA 2.0 array. On my cluster it currently only works in about 350 patient batches, but the normalization doesn't look good because it goes by batch. I consistenly get this segfault memory issue trying to run the oligo::rma() step on all files. Please help I've tried so many different things haha. I'm listing a few examples below. Open to any suggestions on how to fix this! Or other packages that might do the job. I'm using R/4.2.2. Also: It's not compatible with reading it in with the affy() package. Thank you!

#########################################################################################
###Trial 1: default

library(oligoClasses)
library(oligo)
library(pd.hta.2.0)

setwd(cel.dir)
celFile <- list.files(".")
celFile=celFile[1:500]
rawbatch = read.celfiles(celFile)
RMA = oligo::rma(rawbatch, target="core")

#Error: Segmentation fault

#########################################################################################
###Trial 2: running each step separately

back_corrected   <- backgroundCorrect(rawbatch)
oligo_normalised <- normalize(back_corrected,method='quantile',which='pm')
oligo_summarised <- rma(oligo_normalised,background=FALSE,normalize=FALSE)

#Error: Segmentation fault

#########################################################################################
###Trial 3: trying the ff package

#Section 7.1 support for large datasets: https://www.bioconductor.org/packages/release/bioc/vignettes/oligo/inst/doc/oug.pdf

library(ff)
library(foreach)
library(doMC)
registerDoMC(2)
library(oligo)

rawbatch = read.celfiles(celFile)

# Error in if (length < 0 || length > .Machine$integer.max) stop("length must be between 0 and .Machine$integer.max") : 
#   missing value where TRUE/FALSE needed
# In addition: Warning message:
# In ff(initdata = initdata, vmode = vmode, dim = dim, pattern = file.path(ldPath(),  :
#   NAs introduced by coercion to integer range

#########################################################################################
#I was going to try to change the # of ocSamples() or ocProbesets() but I can't even read in the data 
#properly with ff without a memory error (see above)
ocSamples(50)

#Methods (rma) uses batches to process data. When possible (eg., background correction),
#it uses at most ocSamples() samples simultaneously at processing. For procedures that
#process probes (like summarization), a maximum of ocProbesets() are used simultaneously.
#Therefore, the user should tune these parameters for a better performance.
#https://www.rdocumentation.org/packages/oligo/versions/1.36.1/topics/ocSamples
microarray normalization oligo • 1.5k views
ADD COMMENT
0
Entering edit mode

How much memory is available?

ADD REPLY
0
Entering edit mode

I believe our compute nodes can handle 350GB of RAM #SBATCH --nodes=1 #SBATCH --mem=350G

I've also set jobs like this before, but I don't know how to use parallel computing for this task #SBATCH --cpus-per-task=30 #SBATCH --gres=gpu:4

ADD REPLY
0
Entering edit mode

Parallel has nothing to do with this. It's all single-threaded. My guess is that you have at least one bad file. I would use a foor loop to read all files using the read.celfiles function and see for which it throws an error. Be sure to ls -lh all files on the command line to see if any are empty / size zero.

ADD REPLY
0
Entering edit mode

thanks haha I've thought that too & already checked that. I can run normalization on all files properly in batches of 350 sorry if I didn't make that clear. I can use the default read.celfiles function properly but it fails at the oligo:rma() line only when I use all 1000 samples.

I was talking about parallel because of this trial 3 in my note see section 7.2 of the manual link below where ff isn't working for me to read in the files. The support manual for oligo discusses parallel as an option for large datasets. https://www.bioconductor.org/packages/release/bioc/vignettes/oligo/inst/doc/oug.pdf. It's similar to this error: https://support.bioconductor.org/p/49877/ having to do with the Machine$integer.max being exceeded trying to load in so many files. Any other recommendations?

ADD REPLY
0
Entering edit mode

this is pretty off brand for me and i dont really recommend my own answer, but since it seems like youve run multiple batches of 350, and you only have 1000 total, why not just run 10 or 20 sets of 350, making sure each sample is represented at least K times, then investigate whether the results differ significantly for any samples between the batches in which that sample is a consituent member (of the 3 fiddy)

ADD REPLY
0
Entering edit mode

Hmm yeah that could be an option... it seems like there should be an easier fix though like I was hoping there might be an alternative package or method to try. Since the rma() part is literally 1 line of code haha. It's annoying me that microarrays have been around for so long & I can't solve it quickly

ADD REPLY
0
Entering edit mode

It's odd. I processed microarray data of that size before on a standard laptop with rma do memory should not be an issue. I don't have suggestions unfortunately but memory should not be the problem. I would monitor memory using top on the command line to see if that is the problem while you run rma. Back int he day when people used arrays routinely and rnaseq was not a thing hardly any machine had your 350GB of ram.

ADD REPLY
0
Entering edit mode

was thinking this too. maybe the way something is working with something else inefficient; no idea/cant speculate

ADD REPLY
0
Entering edit mode

lots of other packages instantiate rma, if you arent married to a particular one.

maendtoend is decent

ADD REPLY
0
Entering edit mode

thanks, I checked and unfortunately maendtoend's workflow still uses oligo::rma()

ADD REPLY
0
Entering edit mode

Here is a more reproducible example to see if others have the same issue with normalizing >500-1000 samples from raw CEL files (using GSE88884 or GSE88885)

library(oligoClasses)
library(oligo)
library(pd.hta.2.0)
#downloaded pd.hta.2.0 from https://bioconductor.org/packages/release/data/annotation/html/pd.hta.2.0.html
# and then: install.packages(path_to_file, repos=NULL, type="source)

filePaths = getGEOSuppFiles("GSE88884") #(n=1820) 
# due to large size may need to download GSE88884_RAW.tar (42.1 Gb) manually
#GSE88885 (n=908) GSE88885_RAW.tar (21.4 Gb)

setwd(cel.dir)
celFile <- list.files(".")
celFile=celFile[1:500]    #of course I want to use all files... but subsetting to around where it breaks for me
rawbatch = read.celfiles(celFile)
RMA = oligo::rma(rawbatch, target="core")

If anyone could try these or has a step-by-step workflow for what would work on these datasets please let me know! Thank you

ADD REPLY

Login before adding your answer.

Traffic: 1642 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6