Question

Is Centering and Scaling Essential for Expression Data as Input for Machine Learning Algorithms?

0

Entering edit mode

15 days ago

Sib ▴ 60

I have performed differential expression analysis using the Limma package on microarray data. Next, I plan to perform feature selection using various methods such as Gini Index, Information Gain, Information Gain Ratio, Rule, Chi Squared Statistic, Tree Importance, Uncertainty, Deviation, Correlation, Relief, SVM, and PCA weighting algorithms. After selecting features, I intend to build models using one of the following algorithms: KNN, Neural Network, SVM, or Random Forest, and evaluate their performance.

My question concerns preparing the input for these ML steps. After downloading the microarray series matrix data and extracting the expression matrix using the exprs() function, is it sufficient to just normalize the data (using normalizeQuantiles()) and apply a log transformation, or is it also necessary to perform standardization (to achieve zero mean and unit variance)?

I came across a discussion on BioStars titled "What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?" , where Kevin Blighe mentions normalization as a preprocessing step but does not discuss centering and scaling. I wonder if this is due to summarization or if scaling and centering are not considered essential steps for preparing microarray and RNA-seq data as input for machine learning algorithms.

I would greatly appreciate any clarification on this matter and a reference to support the answer.

featureselection microarray preprocessing • 370 views

ADD COMMENT • link 12 days ago by Sib ▴ 60

1

Entering edit mode

I'll put this in a comment since I am mostly guessing here. But I'd say that if the data was comes from a single experiment, all normalized together then the data is scaled appropriately and no centering is required. If the data were collected across substantially different treatments, setups etc. then you would need to postprocess in multiple ways

ADD REPLY • link 14 days ago by Istvan Albert 102k

0

Entering edit mode

Yes data is from a single experiment.

ADD REPLY • link 14 days ago by Sib ▴ 60

0

Entering edit mode

Sorry, is log2 taransformation also needed? Or we can apply ML methods on expression data without it?

ADD REPLY • link 12 days ago by Sib ▴ 60