Question

Machine Learning - Preprocessing on training and testing

0

Entering edit mode

3.1 years ago

mincej20 • 0

I'm unsure of how to tackle preprocessing for machine learning. Typically, a training dataset is used to fit a scaler or normalizer, and this fitted model is used to transform the training and the testing data. The reason for this is so that information from the testing set does not influence the training set and cause overestimation of the model.

I'm lost as to how this process can apply to common microarray preprocessing algorithms which rely on the information from the total arrays of an experiment, including Robust Multi-Array normalization and ComBat batch effect correction. I've tried to search for the literature for how this is handled, but I've only seen as far as "Data was preprocessed using X, Y, and batch effects were corrected using Z." followed by which algorithms were used to classify.

Are there resources or recommendations on best practices for machine learning with gene expression data that I'm missing? Thanks for any help

machine learning preprocessing microarray • 753 views

ADD COMMENT • link 3.1 years ago by mincej20 • 0

score 1 · Accepted Answer · 2021-11-09

I'm going to answer my own question for anyone that has this question out there.
If you make your training and test split, preprocess and normalize them separately, you can use ComBat and reference batches to handle the training/testing normalization. ComBat can use your training dataset as the reference batch, and will bring the mean and variance of the other batches (the test set) to match that of the reference set. The documentation of ComBat even suggests this use case.

See the first paragraph of page 9 on the ComBat and SVA manual: https://bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf