Question

Is a log2 transformation an essential step in preparing expression data for machine learning?

0

Entering edit mode

4 days ago

Sib ▴ 60

I have performed differential expression analysis using the Limma package on one microarray experiment. Next, I plan to perform feature selection using various methods such as Gini Index, Information Gain, Information Gain Ratio, Rule, Chi Squared Statistic, Tree Importance, Uncertainty, Deviation, Correlation, Relief, SVM, and PCA weighting algorithms. After selecting features, I intend to build models using one of the following algorithms: KNN, Neural Network, SVM, or Random Forest, and evaluate their performance.

My question concerns preparing the input for these ML steps. Is a log2 transformation an essential step in preparing the expression data for machine learning, or can the normalized raw expression data also be input directly into ML methods without log2 transformation? I would greatly appreciate any clarification on this matter .

Machine-learning preproccessing microarray • 358 views

ADD COMMENT • link updated 1 day ago by Ram 45k • written 4 days ago by Sib ▴ 60

0

Entering edit mode

You're shooting buzz words as in the question before. Please read the underlying literature and follow guided tutorials. There is no general answer to this. log2 is often preferrable to dampen the variance of the data simply due to count magnitude, but what that means in each particular method may depend on how it works.

ADD REPLY • link 4 days ago by ATpoint 87k

0

Entering edit mode

I would greatly appreciate it if you could refer me to a tutorial on conducting feature selection for microarray data using some (or all) of the mentioned methods (e.g., Gini Index, Information Gain, SVM-based selection, etc.). Thank you! Unfortuately, I wasn't able to find one!

ADD REPLY • link 4 days ago by Sib ▴ 60

score 0 · Answer 1 · 2025-04-13

Is a log2 transformation an essential step in preparing the expression data for machine learning, or can the normalized raw expression data also be input directly into ML methods without log2 transformation?

The answer to the first question is no. To the second, it depends.

To this day I have no idea why log2-transformation is always used in expression analysis when there are better power transformations. Box-Cox generalizes better on all skewed data than either log2 or square root transformations, both of which are special cases of Box-Cox. You can read more about it here. But yes, some kind of power transformation may be needed.

Tree-based machine learning methods, specifically a random forest and gradient-boosted trees, do not care about the scale of data points. That means raw data would be acceptable. Neural network and SVMs work best with small absolute values, so scaling data to small value ranges would be necessary. Because of data skewness, a power transformation would be more desirable than simple scaling. Finally, generalized linear models require normally distributed data, so again one would have to apply a power transformation. Log2 or square root transformations might work well enough in most cases, but they can at best bring the data to normality as good as a Box-Cox transformation. Generally speaking, Box-Cox will do a better job than either of them.