Hi everyone,
I have microarray dataset (700 patients) I identified different genes that correlates with the oncogene of interest. I made two gene set:
1) the genes that most correlate with the oncogene 2) the genes that anti-correlate with the oncogene
I would like to separate the patient in two group (for instance, the patients where the gene signature is present and the ones they do not express the signature).
What I would like to generate is at the end a dummy variable (1 for the signature present in the patient, 0 not). How I can establish the signature is present in the patient? exist also a test/metric to evaluate this as significant? If you can also suggest how to implement this in R it would be great.
Thank you in advance for your help,
best
Salvo
I am not sure if I understand your approach, did you find it somewhere in literature or came up with it yourself? It looks to me like you are mixing up two things, 1) machine learning, 2) limma roast.
With machine learning, you select a set of genes (also called feature selection), and then with a prediction model you can classify each sample into a group. For this kind of analysis you'll need predefined groups, for example patients with good or bad prognosis. I haven't seen any good feature selection methods based on correlation or anti-correlation, though.
With limma roast you can test a gene signature when comparing groups statistically. So not per sample, but per group contrast.
a part from calculating the correlation with MYCN I did other analysis.
I have a microarray dataset of patients where there are MYCN amplified and MYCN not amplified, I performed logistic regression with L1 penalty (lasso) to do feature selection. So you suggest to use like KNN to divide in two groups according to the feature selected in this way?
I have also the clinical data, the idea is after to do a Cox proportional-hazards model and Kaplan Meyer curve with this group
I think that sounds more feasible, the lasso selected features for e.g. KNN or other ML method. Take a training and test set into account. Survival is another option, it depends on your research question (can you divide patient with/without MYC amplification by gene expression profile, or the other question is if the profiles can predict survival).
Through ML algorithms I am performing feature extraction that are important to determine if a patient is a MYCN amplified or not, from this I would like to select some specific signatures (go pathways, some genes upregulated in cell lines by some drugs) and I would like to separate the patients in two group according if onne signature is present or not. Then I would like to see which of this signature has a best impact on prognosis