Hello all,
We are trying to use Microarray and RNA-Seq gene expression datasets of the same type of cancer into a Machine Learning pipeline and we are looking for a method that could align (or something close to that) the different dataset medians and interquartile ranges. They're already well normalized individually, so we want to extend it to a cross-dataset normalization. Ideally, we want to use a training data example to fit some sort of scaler and use it for transforming the new testing dataset into a compatible dataset (similar range and median) that would be used for classification testing. The final model should be able to classify either Microarray transformed data and RNA-Seq transformed data.
We already tried some approaches like quantile transformation (sklearn.preprocessing.QuantileTransformer), non-paranormal transformation (either with shrunken ECDF and truncated ECDF - available on huge R package), normal distribution mapping by Yeo-Johnson transform method that accepts 0 values (sklearn.preprocessing.PowerTransformer), standardization/mean removal (sklearn.preprocessing.StandardScaler), the normalization method available on scikit-learn with l2 norm (sklearn.preprocessing.normalize) and even a simple min-max scaling, which strangely showed one of the best cross-dataset test performances. However, none of the methods succeeded in aligning the medians in a separated process (data transformation of each dataset independently), and just some of them transformed the data to have similar interquartile ranges (subjectively analyzed with boxplots).
Some of our current questions are:
- can we even compare these two technologies like that?
- which method should we choose for transforming/normalizing the data?
- and should we expect aligned boxplots for samples with the same class but coming from different datasets? (I mean, is there a method capable of doing so?)
@Mensur DIakic, If you know, would you please introduce a couple of reading/training materials on applying classifiers (like RF) on genomic data? For sure Google can help but need to have ideas from an experinced fellow. Thanks
I have never built a classifier specifically for RNA-Seq or microarray data, but there should be no major differences here from any other data type. As long as you set up proper cross-validation, random forests usually do not overfit and tend to work out of the box. If you want to squeeze the last bit out of it, extreme boosting methods such as
xgboost
andLightGBM
can do even better, but they require greater care and some hyperparameter optimization.Hello Professor Dlakic, first of all, thanks for your answer.
I'm afraid I might've miscommunicated our needs. We already have a decent classification model (using a Gradient Boosting Classifier and also SVM) for each dataset isolated (validated by a cross-validation process). The individual data transformations are, I think, well-carried and the data preprocessing looks pretty good. By the way, the Microarray datasets came specifically from a study that published an extensively curated microarray database for ML benchmarking.
However, what we really want to do is to be able to train a model in, let's say, an RNA-Seq dataset, and use this trained model to classify samples comming from different datasets, which may be originated also from RNA-Seq as well as Microarray experiments. So the transformation method should be able to "map" the new data to the same IQR/median as the training data (now, this is what we thought and started to question).
As we know, not only these methods work differently but there are also variations from one technology to another (e.g, different microarray platforms). There already are some scientific efforts to understad how could we eliminate those lab differences without interfering in the biological information of the data. We even reviewed a few studies investigating cross-platform normalization techniques, but we were still not able to achieve that similar IQR/median between different datasets I mentioned above. Maybe, as the preliminar results suggest, we should forget about this and simply MinMax the preprocessed data?
I have no first-hand experience with this kind of data. That said, it seems reasonable to repeat the standardization procedure with future data that they already did to create the benchmarking dataset.
I would not rely on the fact that MinMax scaling happens to work on some unseen data you have tried. As you know, the sigmoid has a property of being relatively insensitive to small changes in some parts of the plot, but there is a part where it rises steeply. It may be that the datasets you tried so far were on a similar enough scale as your training data that a simple MinMax was sufficient. I would hesitate to extrapolate that to all future datasets. At the very least I would try subtracting the mean and scaling the variance, though you may need to re-train the original classifier using the same approach.