Hi all,
I have RNA-seq samples from two groups (responders / non-responders). I am interested in generating a predictive gene signature which can separate the two groups. Based on a previous post, I have now decided to use lasso-penalized regression or elastic net regression.
So, now I am looking to evaluate this signature.
- First, I can do this with a training and test set.
- Second, I would like to test these in independently generated datasets. RNA-seq datasets but also qPCR.
My question now is how do I do this? The first one is straightforward. Just split the data (80% for building a predictive model, 20% for evaluating the model) and then make prediction on test data. But how can I do this for an independently generated dataset? I cannot directly use the final model on the independent datasets I assume.
Thank you for your help/input!
Thank you so much for your answer!
Does this still work well when you have different data types as well? Model build on RNA-seq and applied to qPCR?
Thank you for your input on how to perform the regression: So I will now use penalised regression, trying out different alphas and then apply stepwise regression if too many variables still remain. I have decided on 15 samples per group for the discovery/training set and 5 per group for the validation/test set. Thanks again!!!
A universal tenet of making predictions is that the degree to which they can be trusted is dependent upon how similar the underlying data is to that used to train/fit the model. Using a model fit on one data type to make predictions on a significantly different data type is going to lead to a world of headaches.
Thanks for the input. But applying a model build on RNA-seq data to an independent RNA-seq dataset is generally accepted?
Is there anything you could suggest on how to translate such findings between data types?
Yes, that's acceptable since the model was built on similar data. If you start changing library protocols and such then the results will get less reliable, of course. I've never tried running models fit on RNAseq to qPCR data, so I don't know off-hand exactly what transformations would be best. Perhaps Kevin has done that, but I suspect you'll have to find some matched datasets and play around with the data to see what's reasonable.
Yes, as alluded by Devon, performing the RNA-seq model predictions on qPCR data may not be valid. The general process would be this:
In the past, what we did was take genes from RNA-seq that were differentially expressed and then tested these on NanoString data. We then only performed model building on NanoString data itself. We also did the same for RNA-seq and Fluidigm data. There is no real definitive way to do this, though.
Thank you so much for your input!