Hi all,
I am currently planning to develop a predictive model in order to get a set of gene for a specific disease. I came across this discussion and answer by Kevin Blighe here (What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?) and I think it is suitable for my project.
The current data that I have is Nanostring data of combined diseases (raw assay count and normalized count based on grouping Healthy vs Non-Healthy), where one sample has more than one disease. My initial thought is it is not possible to train using these data and therefore, I would need to find DEG of specific disease that I want, and then use my Nanostring data as the testing set.
What I have in plan is:
- Find DEG for each specific disease (from TGCA or other public database) - Set A
- Refine Set A against my Nanostring data using PCA clustering
- Further refine or validate outcome from Step 2 using new set of sample (get new patients and run Nanostring).
However, my Nanostring data is collected from samples with cancer in which I understand there will be high probability of high variablity. Can anyone help to give some advice on this?