Hello Fellow Scientists,
I have 5 microarray datasets (different platforms). Each dataset had disease and healthy samples. Few datasets had only 4 disease and 3 healthy samples while others had more. I wanted to run ML algorithms on them and since ML requires large number of samples, I was trying to find a way to combine these datasets. Here is what I did, and I would like to know whether this method is correct.
- I combined /concatenated expression matrices (gcrma / neqc normalized) of all of them into one by taking common genes measured. I had around 8000 genes as rows and 200 samples as columns.
- I used scale() function in R and converted expression values into z scores.
- I then used this z scores matrix and few gene signatures as an input for GSVA.
- The output for GSVA (gene signatures as rows, samples as columns, enrichment score values between -1 to 1) was used as an input for ML.
Is this method correct? What are some other ways to run ML algorithms on gene expression data? The goal for running ML is to find genes / gene signatures that separate disease from healthy.
Thank you Kevin
You are welcome, SnehaS