Question

How to evaluate a biomarker signature in an independent dataset

0

Entering edit mode

7.0 years ago

JJ ▴ 760

Hi all,

I have RNA-seq samples from two groups (responders / non-responders). I am interested in generating a predictive gene signature which can separate the two groups. Based on a previous post, I have now decided to use lasso-penalized regression or elastic net regression.

So, now I am looking to evaluate this signature.

First, I can do this with a training and test set.
Second, I would like to test these in independently generated datasets. RNA-seq datasets but also qPCR.

My question now is how do I do this? The first one is straightforward. Just split the data (80% for building a predictive model, 20% for evaluating the model) and then make prediction on test data. But how can I do this for an independently generated dataset? I cannot directly use the final model on the independent datasets I assume.

Thank you for your help/input!

RNA-Seq • 2.6k views

ADD COMMENT • link 7.0 years ago by JJ ▴ 760

score 4 · Accepted Answer · 2018-08-27

4

Entering edit mode

7.0 years ago

Kevin Blighe 89k

You can certainly use the same model on the new data and make predictions on it - this is where the real testing of the work comes into play. It just requires the same variable names (here, gene names) and obviously your new data should be on the same scale and processed in the same way. I've done this for predicting ethnicity using SNPs and it is surprisingly 'good', in terms of sensitivity / specificity and ROC analysis.

My experience of using lasso-penalised regression is that it's not that great for identifying a definitive model. It can certainly help to reduce a large variable load to a more manageable number, like 50-100. One can then apply stepwise regression on the reduced dataset and further test a few final models for things like R2 shrinkage and through ROC analysis.

Note that lasso-panalised, elastic-net, and ridge regression merely differ based on the value of alpha:

The elastic-net penalty is controlled by (\alpha), and bridges the gap between lasso ((\alpha=1), the default) and ridge ((\alpha=0)). The tuning parameter (\lambda) controls the overall strength of the penalty.

[from https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html]

------------------------------

I've generated some Powerpoint notes on model testing on new data on my GitHub page: https://github.com/kevinblighe/Rtutorials

Kevin

ADD COMMENT • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you so much for your answer!

Does this still work well when you have different data types as well? Model build on RNA-seq and applied to qPCR?

Thank you for your input on how to perform the regression: So I will now use penalised regression, trying out different alphas and then apply stepwise regression if too many variables still remain. I have decided on 15 samples per group for the discovery/training set and 5 per group for the validation/test set. Thanks again!!!

ADD REPLY • link 7.0 years ago by JJ ▴ 760

2

Entering edit mode

A universal tenet of making predictions is that the degree to which they can be trusted is dependent upon how similar the underlying data is to that used to train/fit the model. Using a model fit on one data type to make predictions on a significantly different data type is going to lead to a world of headaches.

ADD REPLY • link 7.0 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks for the input. But applying a model build on RNA-seq data to an independent RNA-seq dataset is generally accepted?

Is there anything you could suggest on how to translate such findings between data types?

ADD REPLY • link 7.0 years ago by JJ ▴ 760

2

Entering edit mode

Yes, that's acceptable since the model was built on similar data. If you start changing library protocols and such then the results will get less reliable, of course. I've never tried running models fit on RNAseq to qPCR data, so I don't know off-hand exactly what transformations would be best. Perhaps Kevin has done that, but I suspect you'll have to find some matched datasets and play around with the data to see what's reasonable.

ADD REPLY • link 7.0 years ago by Devon Ryan 105k

1

Entering edit mode

Yes, as alluded by Devon, performing the RNA-seq model predictions on qPCR data may not be valid. The general process would be this:

Build model predictor from RNA-seq training data
Perform model predictions on both the training and testing data from the same RNA-seq experiment
Perform model predictions on independent RNA-seq experiments processed in the same way (optional)
Put your final panel of genes to the test by independently re-performing differential analysis / model building, but, this time, using a targeted method, such as high-throughput qPCR, NanoString, etc., and usually on a higher number of samples.
Further refine your model based on #4

In the past, what we did was take genes from RNA-seq that were differentially expressed and then tested these on NanoString data. We then only performed model building on NanoString data itself. We also did the same for RNA-seq and Fluidigm data. There is no real definitive way to do this, though.