Question

What is the best way for applying Transfer learning on gene expression data ?

0

Entering edit mode

6.3 years ago

elmahy2005 ▴ 150

Traditional machine learning workflows on gene expression data apply several filtering approaches and feature selection and apply a classifier (RF, SVM or an ensemble) on the selected features. They deal with each dataset from scratch.

I am trying -not to combine different datasets to avoid batch effect- but rather to develop several models (collect subsets of genes that predict well together in several models) and keep training on several datasets until I end up with a set of models (each one containing a different subset of genes, different parameter and probably a different model) to be used in any classification problem on gene expression data. Is this approach new ? Do you have any suggested ideas to try ? This is not typical transfer learning but any help on how to inherit information from one gene expression dataset to another is really helpful .

NB:

My question is not about integration of different genomic data types from the same person. It's about developing a model on a dataset and -to overcome the small sample sizes- we retrain the model (by retrain here, I mean I take the model name (e.g. RF) and the set of genes (names of the ten genes) that worked together and use them) on another classification problem from another dataset, and we keep doing this until we have a set of mature heuristics. All the datasets I am refering to are gene expression data (e.g. from TCGA). For example, my algorithm is like that :

1- I find top 100 important genes using RF on a breast cancer dataset. and I run several RF classifiers, each with only ten genes.

2- I repeat the same step on another dataset e.g. colon cancer dataset and on several other datasets

3- I take the best five classifiers (here I mean I take the model name and genes used as heuristics not the model itself) from each dataset and run all these classifiers on a new dataset. and keep iterating and improving.

The expected outcome should be : a specific way to use gene expression data for classification. Instead of filtering, we will say e.g.: take genes 40, 671, 899 and apply RF on them and take genes 55, 1000, 242 and apply a logistic regression on them and take genes 44, ...,555 and apply blablabla algorithm on them. And on a new classification problem, run cross validation on these models to find the accurate models.

Since these models (heuristics) are based on information from several previous training, they should overcome models based only on the dataset at hand especially if this dataset is very small.

Is this approach valid ? Is it popular in bioinformatics under another name ?

My wording isn't good, please ask for clarification if any point is unclear.

RNA-Seq machine learning • 2.9k views

ADD COMMENT • link 6.3 years ago by elmahy2005 ▴ 150

score 1 · Answer 1 · 2019-03-03

This is not a new approach and you will need to read gene expression prediction papers based on ML/AI etc., past 2-3 years. There are some already amazing works out there.

However, you are asking for a very broad spectrum of a question. Your query is trying to address both classification and prediction. Each is a monster on its own and there is no direct way to address. When you want to use multi -omics layer of data from the same patients but with a diverse feature, you need to first extract the features via classifiers, that can span either from traditional ML-like regression, or dimension reduction, etc., to extract the features. Once you do that and know the associated features you will need to use prediction. This is when you need to use fitting the model with something again that can be regression-based, SVR, RF or deep learning like ANN or CNN. There is no magic method that will directly say one method for your data type and features. You will need to understand your data shape, data features, and data distributions and perform rigorous assessments. Once you understand that your data upon classification and the unification of the feature layer is non-linear in nature, then you will have to then try out the different prediction algorithms that will be able to take account of the non-linear function and give you the best predictors. Now, all the above is based on multi-omics, that either uses the same donor information of varies depending on how uni-layer non-linear function is made and farther prediction is being based on that.

Some of your queries remain unclear, probably address them to better frame the question and also look into some suggestions I mentioned below:

If you are not trying to combine different datasets are you combining multi-omics or just gene expression of different tissues?
What do you mean by several models?
How big is your dataset? Number of genes expression, number of patients samples, what disease models, etc (everything adds up a complexity layer)
A subset of genes for classification and prediction can be done with a simple regression or for that matter with Centroid or KNN as well.
If it is just one gene expression dataset, then how big it is that you will need an ML classification followed by an ML prediction? If your dataset is not multi-feature centric you will not even need such a model. A prediction can be easily done with algorithms like SNF. Have you looked into them?

These are some suggestions I can give. But, just in case, if data is not able to solve biological gene set queries for molecular sub-type prediction with simple algorithms then only think in those lines of ML provided your data shape and distribution says so and that you are addressing a multi-feature problem. Re-inventing wheel is not necessary if any simple classification of gene sets for molecular subtype classification can do the trick. Then any work done in ML stands null and void since this will not be robust. Good luck!