Proper preprocessing for ML after limma, quantile normalization and log2 transformation: Is standardization still necessary?
1
1
Entering edit mode
19 days ago
Sib ▴ 70

Hi everyone,

I’ve done differential expression analysis on microarray data using limma. After extracting the expression matrix with exprs(), I applied quantile normalization and log2 transformation.

Now I plan to perform feature selection (e.g., Gini Index, Information Gain, Information Gain Ratio, Rule, Chi Squared Statistic, Tree Importance, Uncertainty, Deviation, Correlation, Relief, SVM, and PCA weights) and build ML models like KNN, SVM, Neural Network, and Random Forest.

One ML expert suggested that if the data is from a single experiment and already quantile-normalized, no further centering or scaling is needed.

However, since ML algorithms like KNN, SVM, and NN are sensitive to feature scale and for PCA-based feature extraction, standardization is also essential due to PCA’s sensitivity to feature variance, shouldn’t I still apply z-score standardization (zero mean, unit variance) before feature selection and building model?

Is the expert’s advice incorrect for ML workflows, even if valid for DE analysis?

Thanks for your insights!

featureselection microarray preprocessing • 1.4k views
ADD COMMENT
0
0
Entering edit mode

That question was about the need or lack of need to perform log transformation. This question is about the need or lack of need to standardization

ADD REPLY
1
Entering edit mode
19 days ago
Mensur Dlakic ★ 29k

However, since ML algorithms like KNN, SVM, and NN are sensitive to feature scale and for PCA-based feature extraction, standardization is also essential due to PCA’s sensitivity to feature variance, shouldn’t I still apply z-score standardization (zero mean, unit variance) before feature selection and building model?

Same answer as before, which GenoMax already pointed out to you. NNs and SVMs work only with small ranges of data, so scaling / normalization must be done. If you are not sure what that means, say [-10, 10] would be a fairly large range. Tree-based methods don't care about the scale of data, which means raw counts would be acceptable.

PCA works properly only with normally distributed data, which you seem to know. How exactly you get the data to best normality depends on the skew. If there is a large skew - usually the case - then a power transformation is needed. I recommend a Box-Cox transformation, but log2 transformation would likely work as well. If there is no large skew, what you call z-score standardization would be fine.

If something I wrote originally was not clear, you could have asked a follow-up question rather than opening a new thread.

ADD COMMENT
0
Entering edit mode

Thank you for the detailed explanation. You mentioned that [-10, 10] is a large range. Initially, my data spanned [-6, 17]. After quantile normalization, the range shifted to [-2, 14] (not a good rediction in range). I then applied z-score standardization, which expanded the range to [-10, 7](again not a good reduction)! This suggests z-score may not be ideal for standardization here. My follow-up question is: 

What scaling method would better reduce the range?     Min-max normalization could help, but it is sensitive to outliers. Should I remove outliers first (e.g., using IQR or percentile thresholds) before applying min-max?     Are there robust scaling alternatives  that better handle that? 

ADD REPLY
0
Entering edit mode

I do data normalization almost on a daily basis, but not with this type of data. There could be a more useful advice coming from someone who works with this often.

I am not sure what you are doing here, but generally speaking there is no sense in applying one transformation on top of the other. You say the initial data is [-6, 17], which I presume is after quantile normalization and log2 transformation. There should not be a third transformation (a quantile normalization) applied on top of it. That's why you are not getting much benefit from subsequent transformations, because the data is already transformed.

I already told you a couple of times about the best transformation (Box-Cox), but apparently didn't make enough of an impression to make you consider it. Maybe it will help to read more details about it here. Box-Cox should be applied to raw counts, with no pre- or post-processing steps, and hopefully will do the trick.

That initial [-6, 17] range is probably okay for NNs assuming it was correctly calculated, meaning by applying a single transformation. It might be okay for SVMs, though I would probably min-max scale that to [-1, 1]. The only reason I am suggesting min-max here as a secondary processing step is because SVM's kernel function gets bogged down when working with large numbers.

Min-max is not good for raw data because of large skew, so it shouldn't be an option for expression data. Removing outliers won't change that. The outliers can always be removed after applying power transformations - if warranted.

ADD REPLY
0
Entering edit mode

Unfortunately, I cannot use raw data because I am teaching students who are not yet able to analyze raw data. Instead, we use a series matrix dataset that has undergone some preprocessing by the data submitter. However, the exact nature of the preprocessing is unclear. 

I am attaching the box plot of the series matrix with [-6, 17] range here. Based on the distribution, I suspect the data has been log-transformed, but it may still require normalization. 

Do you think it would be better to: 

  1. Use the data directly in the machine learning methods I mentioned, or 
  1. Apply additional preprocessing steps such as  Box-Cox normalization, or Quantile normalization, or Outlier removal followed by Min-Max normalization (scaling to [-1, 1])? 

Alternatively, do you have any other suggestions for further preprocessing of this preprocessed data?

ADD REPLY
0
Entering edit mode

I have no good idea how to work with data of unknown origin and unknown pre-processing steps.

Somehow, you seem to be reading my answers selectively and I have to repeat everything at least twice.

Your first question was already answered:

That initial [-6, 17] range is probably okay for NNs assuming it was correctly calculated, meaning by applying a single transformation. It might be okay for SVMs, though I would probably min-max scale that to [-1, 1]. The only reason I am suggesting min-max here as a secondary processing step is because SVM's kernel function gets bogged down when working with large numbers.

Your second question was already answered:

I am not sure what you are doing here, but generally speaking there is no sense in applying one transformation on top of the other. You say the initial data is [-6, 17], which I presume is after quantile normalization and log2 transformation. There should not be a third transformation (a quantile normalization) applied on top of it. That's why you are not getting much benefit from subsequent transformations, because the data is already transformed.

There is really nothing else I can add here given the available information.

ADD REPLY
0
Entering edit mode

No, that’s not correct. I carefully read your response multiple times (at least 10 times) before formulating my follow-up questions based on your comments.

You stated:

"That initial [-6, 17] range is probably okay for NNs"

Based on this, I decided not to use additional preprocessing steps and instead use the original data for the neural network (NN). However, when I examined the box plot of the initial data, I noticed that the boxes were not uniformly ordered across a single range.

In microarray analysis pipelines—particularly for identifying differentially expressed genes—when we have to use the preprocessed series matrix data with unknown preprocessing steps and box plots are unordered, quantile normalization is typically applied. However, I’m uncertain whether this is standard practice in machine learning for similar cases.

But you mentioned:

"There should not be a third transformation (e.g., quantile normalization) applied on top of it."

Given this, my question is: Should I proceed with the initial data for the NN, despite the unordered box plots? Is it acceptable to use the data directly while disregarding the lack of ordering in the box plots?

Additionally, I’d greatly appreciate your insights on how to handle this issue for KNN, PCA, and Relief?

ADD REPLY
0
Entering edit mode

If you really believe that holding a user at gunpoint with bold font and such quotes "you said that...but then you said that..." then I fear you will lose support here very soon.

Look, people here give generic advise. It is entirely on you to then dig into the theory behind all this, incorporating the advise (or not, that's just as fine) and transform all that into your final analysis code. Nobody here is responsible for your work, you cannot expect hands-on guidance. Consider to take a step back and read about the fundamentals of the methods you aim to apply.

ADD REPLY
0
Entering edit mode

Additionally, I’d greatly appreciate your insights on how to handle this issue for KNN, PCA, and Relief?

I again refer you to what I already wrote about PCA. There is nothing else I can add to that subject when dealing with previously modified data, or to KNN and Relief methods.

PCA works properly only with normally distributed data, which you seem to know. How exactly you get the data to best normality depends on the skew. If there is a large skew - usually the case - then a power transformation is needed. I recommend a Box-Cox transformation, but log2 transformation would likely work as well. If there is no large skew, what you call z-score standardization would be fine.

You can safely assume that I have nothing else to contribute on this subject beyond referring you to my earlier comments.

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6