Traditional machine learning workflows on gene expression data apply several filtering approaches and feature selection and apply a classifier (RF, SVM or an ensemble) on the selected features. They deal with each dataset from scratch.
I am trying -not to combine different datasets to avoid batch effect- but rather to develop several models (collect subsets of genes that predict well together in several models) and keep training on several datasets until I end up with a set of models (each one containing a different subset of genes, different parameter and probably a different model) to be used in any classification problem on gene expression data. Is this approach new ? Do you have any suggested ideas to try ? This is not typical transfer learning but any help on how to inherit information from one gene expression dataset to another is really helpful .
NB:
My question is not about integration of different genomic data types from the same person. It's about developing a model on a dataset and -to overcome the small sample sizes- we retrain the model (by retrain here, I mean I take the model name (e.g. RF) and the set of genes (names of the ten genes) that worked together and use them) on another classification problem from another dataset, and we keep doing this until we have a set of mature heuristics. All the datasets I am refering to are gene expression data (e.g. from TCGA). For example, my algorithm is like that :
1- I find top 100 important genes using RF on a breast cancer dataset. and I run several RF classifiers, each with only ten genes.
2- I repeat the same step on another dataset e.g. colon cancer dataset and on several other datasets
3- I take the best five classifiers (here I mean I take the model name and genes used as heuristics not the model itself) from each dataset and run all these classifiers on a new dataset. and keep iterating and improving.
The expected outcome should be : a specific way to use gene expression data for classification. Instead of filtering, we will say e.g.: take genes 40, 671, 899 and apply RF on them and take genes 55, 1000, 242 and apply a logistic regression on them and take genes 44, ...,555 and apply blablabla algorithm on them. And on a new classification problem, run cross validation on these models to find the accurate models.
Since these models (heuristics) are based on information from several previous training, they should overcome models based only on the dataset at hand especially if this dataset is very small.
Is this approach valid ? Is it popular in bioinformatics under another name ?
My wording isn't good, please ask for clarification if any point is unclear.
I think my question was not completely understood. I will update it if I have better words. My question is not about integration of different genomic datatype from the same person. It's about developing a model on a dataset and -to overcome the small sample sizes- we retrain the model on another classification problem from another dataset, and we keep doing this until we have a mature model. All the datasets I am refering to are gene expression data (e.g. from TCGA). For example, my algorithm is like that : 1- I find top 100 important genes using RF on a breast cancer dataset. and I run several RF classifiers, each with only ten genes. 2- I repeat the same step on another dataset e.g. colon cancer dataset and on several other datasets
By classifier in this context, I mean a model name (e.g. RF) and a set of genes (names of the ten genes) that worked together to provide accurate results.
2- I take the best five classifiers (i.e. the model name and genes used) from each dataset and run all these classifiers on a new dataset. and keep iterating and improving.
The expected outcome should be : a specific way to use gene expression data for classification. Instead of filtering, we will say e.g.: take genes 40, 671, 899 and apply RF on them and take genes 55, 1000, 242 and apply a logistic regression on them and take genes 44, ...,555 and apply blablabla algorithm on them.
Since these models (heuristics) are based on information from several previous training, with something like cross validation, they can be tuned for a new classification problem and should overcome models based only on the data especially if the data is very small.
Is this approach valid ? Is it popular in bioinformatics under another name ?
My wording isn't good, please ask for clarification if a point is unclear.
Well, it is valid and totally feasible, but as said again, have you looked into the publications of 2018 TCGA or Pan-Cancer projects? There have been similar works based on deep learning or even other ML stuff for finding signature pathways or associating tumor subtype classifications. These are mostly using ML to find gene signatures that can subtype tumors under specific molecular buckets. If I am not wrong this will also lead to what you are trying to achieve with your work and classify the cancer types? How is it different from the papers below? Again, you are doing both classification and outcome prediction subtypes in different tumors via gene expression data sets. This is what I can understand. Another thing is when you say you want to classify, are you thinking of molecular classification or classification tumor vs non-tumor? Your explanation seems like it will just bring out signatures that should find specific pathways by learning for detecting cancer signature pathways or exclusive pathways. I do not think you can do more than that with just gene expression data. Still take a look at the below papers, since they perform similar lines of work as you defined.
Paper 1
Paper 2
Paper 3