Hi all, I am trying to normalize microarray data starting from CEL files of 1000+ patients. It is The Affymetrix HTA 2.0 array. On my cluster it currently only works in about 350 patient batches, but the normalization doesn't look good because it goes by batch. I consistenly get this segfault memory issue trying to run the oligo::rma() step on all files. Please help I've tried so many different things haha. I'm listing a few examples below. Open to any suggestions on how to fix this! Or other packages that might do the job. I'm using R/4.2.2. Also: It's not compatible with reading it in with the affy() package. Thank you!
#########################################################################################
###Trial 1: default
library(oligoClasses)
library(oligo)
library(pd.hta.2.0)
setwd(cel.dir)
celFile <- list.files(".")
celFile=celFile[1:500]
rawbatch = read.celfiles(celFile)
RMA = oligo::rma(rawbatch, target="core")
#Error: Segmentation fault
#########################################################################################
###Trial 2: running each step separately
back_corrected <- backgroundCorrect(rawbatch)
oligo_normalised <- normalize(back_corrected,method='quantile',which='pm')
oligo_summarised <- rma(oligo_normalised,background=FALSE,normalize=FALSE)
#Error: Segmentation fault
#########################################################################################
###Trial 3: trying the ff package
#Section 7.1 support for large datasets: https://www.bioconductor.org/packages/release/bioc/vignettes/oligo/inst/doc/oug.pdf
library(ff)
library(foreach)
library(doMC)
registerDoMC(2)
library(oligo)
rawbatch = read.celfiles(celFile)
# Error in if (length < 0 || length > .Machine$integer.max) stop("length must be between 0 and .Machine$integer.max") :
# missing value where TRUE/FALSE needed
# In addition: Warning message:
# In ff(initdata = initdata, vmode = vmode, dim = dim, pattern = file.path(ldPath(), :
# NAs introduced by coercion to integer range
#########################################################################################
#I was going to try to change the # of ocSamples() or ocProbesets() but I can't even read in the data
#properly with ff without a memory error (see above)
ocSamples(50)
#Methods (rma) uses batches to process data. When possible (eg., background correction),
#it uses at most ocSamples() samples simultaneously at processing. For procedures that
#process probes (like summarization), a maximum of ocProbesets() are used simultaneously.
#Therefore, the user should tune these parameters for a better performance.
#https://www.rdocumentation.org/packages/oligo/versions/1.36.1/topics/ocSamples
How much memory is available?
I believe our compute nodes can handle 350GB of RAM #SBATCH --nodes=1 #SBATCH --mem=350G
I've also set jobs like this before, but I don't know how to use parallel computing for this task #SBATCH --cpus-per-task=30 #SBATCH --gres=gpu:4
Parallel has nothing to do with this. It's all single-threaded. My guess is that you have at least one bad file. I would use a foor loop to read all files using the read.celfiles function and see for which it throws an error. Be sure to
ls -lh
all files on the command line to see if any are empty / size zero.thanks haha I've thought that too & already checked that. I can run normalization on all files properly in batches of 350 sorry if I didn't make that clear. I can use the default read.celfiles function properly but it fails at the oligo:rma() line only when I use all 1000 samples.
I was talking about parallel because of this trial 3 in my note see section 7.2 of the manual link below where ff isn't working for me to read in the files. The support manual for oligo discusses parallel as an option for large datasets. https://www.bioconductor.org/packages/release/bioc/vignettes/oligo/inst/doc/oug.pdf. It's similar to this error: https://support.bioconductor.org/p/49877/ having to do with the Machine$integer.max being exceeded trying to load in so many files. Any other recommendations?
this is pretty off brand for me and i dont really recommend my own answer, but since it seems like youve run multiple batches of 350, and you only have 1000 total, why not just run 10 or 20 sets of 350, making sure each sample is represented at least K times, then investigate whether the results differ significantly for any samples between the batches in which that sample is a consituent member (of the 3 fiddy)
Hmm yeah that could be an option... it seems like there should be an easier fix though like I was hoping there might be an alternative package or method to try. Since the rma() part is literally 1 line of code haha. It's annoying me that microarrays have been around for so long & I can't solve it quickly
It's odd. I processed microarray data of that size before on a standard laptop with rma do memory should not be an issue. I don't have suggestions unfortunately but memory should not be the problem. I would monitor memory using top on the command line to see if that is the problem while you run rma. Back int he day when people used arrays routinely and rnaseq was not a thing hardly any machine had your 350GB of ram.
was thinking this too. maybe the way something is working with something else inefficient; no idea/cant speculate
lots of other packages instantiate rma, if you arent married to a particular one.
maendtoend is decent
thanks, I checked and unfortunately maendtoend's workflow still uses oligo::rma()
Here is a more reproducible example to see if others have the same issue with normalizing >500-1000 samples from raw CEL files (using GSE88884 or GSE88885)
If anyone could try these or has a step-by-step workflow for what would work on these datasets please let me know! Thank you