What is the best way for applying Transfer learning on gene expression data ?
1
0
Entering edit mode
5.7 years ago
elmahy2005 ▴ 150

Traditional machine learning workflows on gene expression data apply several filtering approaches and feature selection and apply a classifier (RF, SVM or an ensemble) on the selected features. They deal with each dataset from scratch.

I am trying -not to combine different datasets to avoid batch effect- but rather to develop several models (collect subsets of genes that predict well together in several models) and keep training on several datasets until I end up with a set of models (each one containing a different subset of genes, different parameter and probably a different model) to be used in any classification problem on gene expression data. Is this approach new ? Do you have any suggested ideas to try ? This is not typical transfer learning but any help on how to inherit information from one gene expression dataset to another is really helpful .

NB:

My question is not about integration of different genomic data types from the same person. It's about developing a model on a dataset and -to overcome the small sample sizes- we retrain the model (by retrain here, I mean I take the model name (e.g. RF) and the set of genes (names of the ten genes) that worked together and use them) on another classification problem from another dataset, and we keep doing this until we have a set of mature heuristics. All the datasets I am refering to are gene expression data (e.g. from TCGA). For example, my algorithm is like that :

1- I find top 100 important genes using RF on a breast cancer dataset. and I run several RF classifiers, each with only ten genes.

2- I repeat the same step on another dataset e.g. colon cancer dataset and on several other datasets

3- I take the best five classifiers (here I mean I take the model name and genes used as heuristics not the model itself) from each dataset and run all these classifiers on a new dataset. and keep iterating and improving.

The expected outcome should be : a specific way to use gene expression data for classification. Instead of filtering, we will say e.g.: take genes 40, 671, 899 and apply RF on them and take genes 55, 1000, 242 and apply a logistic regression on them and take genes 44, ...,555 and apply blablabla algorithm on them. And on a new classification problem, run cross validation on these models to find the accurate models.

Since these models (heuristics) are based on information from several previous training, they should overcome models based only on the dataset at hand especially if this dataset is very small.

Is this approach valid ? Is it popular in bioinformatics under another name ?

My wording isn't good, please ask for clarification if any point is unclear.

RNA-Seq machine learning • 2.7k views
ADD COMMENT
1
Entering edit mode
5.7 years ago
ivivek_ngs ★ 5.2k

This is not a new approach and you will need to read gene expression prediction papers based on ML/AI etc., past 2-3 years. There are some already amazing works out there.

However, you are asking for a very broad spectrum of a question. Your query is trying to address both classification and prediction. Each is a monster on its own and there is no direct way to address. When you want to use multi -omics layer of data from the same patients but with a diverse feature, you need to first extract the features via classifiers, that can span either from traditional ML-like regression, or dimension reduction, etc., to extract the features. Once you do that and know the associated features you will need to use prediction. This is when you need to use fitting the model with something again that can be regression-based, SVR, RF or deep learning like ANN or CNN. There is no magic method that will directly say one method for your data type and features. You will need to understand your data shape, data features, and data distributions and perform rigorous assessments. Once you understand that your data upon classification and the unification of the feature layer is non-linear in nature, then you will have to then try out the different prediction algorithms that will be able to take account of the non-linear function and give you the best predictors. Now, all the above is based on multi-omics, that either uses the same donor information of varies depending on how uni-layer non-linear function is made and farther prediction is being based on that.

Some of your queries remain unclear, probably address them to better frame the question and also look into some suggestions I mentioned below:

  1. If you are not trying to combine different datasets are you combining multi-omics or just gene expression of different tissues?
  2. What do you mean by several models?
  3. How big is your dataset? Number of genes expression, number of patients samples, what disease models, etc (everything adds up a complexity layer)
  4. A subset of genes for classification and prediction can be done with a simple regression or for that matter with Centroid or KNN as well.
  5. If it is just one gene expression dataset, then how big it is that you will need an ML classification followed by an ML prediction? If your dataset is not multi-feature centric you will not even need such a model. A prediction can be easily done with algorithms like SNF. Have you looked into them?

These are some suggestions I can give. But, just in case, if data is not able to solve biological gene set queries for molecular sub-type prediction with simple algorithms then only think in those lines of ML provided your data shape and distribution says so and that you are addressing a multi-feature problem. Re-inventing wheel is not necessary if any simple classification of gene sets for molecular subtype classification can do the trick. Then any work done in ML stands null and void since this will not be robust. Good luck!

ADD COMMENT
0
Entering edit mode

I think my question was not completely understood. I will update it if I have better words. My question is not about integration of different genomic datatype from the same person. It's about developing a model on a dataset and -to overcome the small sample sizes- we retrain the model on another classification problem from another dataset, and we keep doing this until we have a mature model. All the datasets I am refering to are gene expression data (e.g. from TCGA). For example, my algorithm is like that : 1- I find top 100 important genes using RF on a breast cancer dataset. and I run several RF classifiers, each with only ten genes. 2- I repeat the same step on another dataset e.g. colon cancer dataset and on several other datasets

By classifier in this context, I mean a model name (e.g. RF) and a set of genes (names of the ten genes) that worked together to provide accurate results.

2- I take the best five classifiers (i.e. the model name and genes used) from each dataset and run all these classifiers on a new dataset. and keep iterating and improving.

The expected outcome should be : a specific way to use gene expression data for classification. Instead of filtering, we will say e.g.: take genes 40, 671, 899 and apply RF on them and take genes 55, 1000, 242 and apply a logistic regression on them and take genes 44, ...,555 and apply blablabla algorithm on them.

Since these models (heuristics) are based on information from several previous training, with something like cross validation, they can be tuned for a new classification problem and should overcome models based only on the data especially if the data is very small.

Is this approach valid ? Is it popular in bioinformatics under another name ?

My wording isn't good, please ask for clarification if a point is unclear.

ADD REPLY
1
Entering edit mode

Well, it is valid and totally feasible, but as said again, have you looked into the publications of 2018 TCGA or Pan-Cancer projects? There have been similar works based on deep learning or even other ML stuff for finding signature pathways or associating tumor subtype classifications. These are mostly using ML to find gene signatures that can subtype tumors under specific molecular buckets. If I am not wrong this will also lead to what you are trying to achieve with your work and classify the cancer types? How is it different from the papers below? Again, you are doing both classification and outcome prediction subtypes in different tumors via gene expression data sets. This is what I can understand. Another thing is when you say you want to classify, are you thinking of molecular classification or classification tumor vs non-tumor? Your explanation seems like it will just bring out signatures that should find specific pathways by learning for detecting cancer signature pathways or exclusive pathways. I do not think you can do more than that with just gene expression data. Still take a look at the below papers, since they perform similar lines of work as you defined.

Paper 1

Paper 2

Paper 3

ADD REPLY

Login before adding your answer.

Traffic: 2215 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6