I have performed differential expression analysis using the Limma package on microarray data. Next, I plan to perform feature selection using various methods such as Gini Index, Information Gain, Information Gain Ratio, Rule, Chi Squared Statistic, Tree Importance, Uncertainty, Deviation, Correlation, Relief, SVM, and PCA weighting algorithms. After selecting features, I intend to build models using one of the following algorithms: KNN, Neural Network, SVM, or Random Forest, and evaluate their performance.
My question concerns preparing the input for these ML steps. After downloading the microarray series matrix data and extracting the expression matrix using the exprs() function, is it sufficient to just normalize the data (using normalizeQuantiles()) and apply a log transformation, or is it also necessary to perform standardization (to achieve zero mean and unit variance)?
I came across a discussion on BioStars titled "What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?" , where Kevin Blighe mentions normalization as a preprocessing step but does not discuss centering and scaling. I wonder if this is due to summarization or if scaling and centering are not considered essential steps for preparing microarray and RNA-seq data as input for machine learning algorithms.
I would greatly appreciate any clarification on this matter and a reference to support the answer.
I'll put this in a comment since I am mostly guessing here. But I'd say that if the data was comes from a single experiment, all normalized together then the data is scaled appropriately and no centering is required. If the data were collected across substantially different treatments, setups etc. then you would need to postprocess in multiple ways
Yes data is from a single experiment.
Sorry, is log2 taransformation also needed? Or we can apply ML methods on expression data without it?