I presume that you know how to derive a final predictive model? 1,300 genes seems too large. I have posted various answers in relation to this, such as:
Once you decide on your final model, you can test sensitivity, specificity, precision, accuracy, and AUC (and other metrics) by applying your model to the same data on which your model was created and also new data, i.e., a validation cohort.
Here:
- our regression model is called
bayesmodel
- our data (the data on which the model was built or the new data) is
called
modeling
- our outcome / phenotype is stored in a variable called
Disease
(0 /
1) in modeling
.
require(ROCR)
make a model prediction
pred <- prediction(predict(bayesmodel, type="response"), modeling$Disease)
determine sensitivity and specificity
ss <- performance(pred, "sens", "spec")
print("Predicted probability cut-off:")
ss@alpha.values[[1]][which.max(ss@x.values[[1]] + ss@y.values[[1]])]
predicted.prob <- round(ss@alpha.values[[1]][which.max(ss@x.values[[1]] + ss@y.values[[1]])], 2)
print("Sensitivity and specificity, weighed equally, are:")
max(ss@x.values[[1]] + ss@y.values[[1]]) / 2
sens.spec <- round(max(ss@x.values[[1]] + ss@y.values[[1]]) / 2, 2)
determine cost function
cost.perf <- performance(pred, measure="cost")
cost <- pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])]
determine precision
prec.perf <- performance(pred, measure="prec")
ind <- which.max(slot(prec.perf, "y.values")[[1]])
prec <- slot(prec.perf, "y.values")[[1]][ind]
prec.cutoff <- slot(prec.perf, "x.values")[[1]][ind]
print(c(precision=prec, cutoff=prec.cutoff))
determine accuracy
acc.perf <- performance(pred, measure="acc")
ind <- which.max(slot(acc.perf, "y.values")[[1]])
acc <- slot(acc.perf, "y.values")[[1]][ind]
acc.cutoff <- slot(acc.perf, "x.values")[[1]][ind]
print(c(accuracy=acc, cutoff=acc.cutoff))
plot ROC 'curve' and optimum accuracy
par(cex=0.8, mfrow=c(1,2))
require(pROC)
roc <- roc(na.omit(modeling$Disease), predict(bayesmodel, type="response"), smooth=FALSE)
plot.roc(roc, add=FALSE, grid=TRUE, grid.lwd=2, col="red", main="")
text(0.25, 0.2, col="red",
paste("Variant signature:\nAUC [L95-U95]:",
paste(round(ci.auc(roc)[2], 3), " [",
round(ci.auc(roc)[1], 3), "-",
round(ci.auc(roc)[3], 3), "]", sep="", collapse=","), sep=" "),
cex=1.1, font=2)
text(0.25, 0.05, col="red",
paste("Sensitivity/Specificity: ", sens.spec*100, "%", sep=""), cex=1.1, font=2)
# accuracy
plot(acc.perf, main="Accuracy, variant signature", avg="vertical", spread.estimate="boxplot")
abline(v=acc.cutoff, lwd=1, lty=2, col="red")
Ignore the odd looking ROC curve for now. I`ve also edited out key info form these plots/ hence, the weird text labels.
Hope that it helps!
Kevin
Thanks Kevin for the answer. The predictive model should be appiled on raw count or normlaized expression count? Also, in this case (RNA-seq) Bayseian predictive model would be a good choice?
You should build your model on the normalised, logged counts, which would hopefully follow a normal / Gaussian distribution. If you use the normalised, un-logged counts, then you`d have to run a negative binomial regression via glm.nb, as this would expect your data to be distributed as a negative binomial (a distribution that RNA-seq raw and normalised un-logged counts naturally follow). When you log these or scale them to Z-scores, they then more follow a Gaussian.
You do not have to use a Bayesian logistic regression model. I did in the case above for a very specific reason. You could easily just use
glm()
(generalised linear model
) in R withfamily=binomial(link="logit")
- this, them, assumes that youry
variable is encoded as0
/1
,0
/1
/2
, et cetera, and that your data is normally-distributed.Thanks for the reply. In my case I have no pheotype/metadata information other than defined 3 subtypes and the age group (adults or pediatric patients information). And I tried
lm()
model on the dataset, using top 5 Up-regulated genes from 3 subtypes to predict thelm()
model. Then usedstepAIC
for prediction score and later to use model prediction usingprediction
. I am not sure is this the right way I am doing, it would be great if you could walk me little more through this thank you. Additionally, I have modified my question alittle bit. The main part of question is: In both original and validation cohort the unsupervised HC on the differentially expressed (DE) 1300 genes showed a concordance in classifying the molecular subtypes of leukemia on both original and validation cohort. Now, the question is how to calculate the sensitivity and specificity of these 1300 genes?I see, so, the 1300 genes segregate the sub-types via hierarchical cluster. What I would do after that is:
cv.glmnet()
cv.glmnet()
and reduce further viastepAIC()
stepAIC()
outputWith your final models, you would then test for:
Hopefully, your final models will be fairly compact and have high prediction capability.
Alternatively, you could do something like RandomForest on the 1300 genes, or just go ahead and fit a weighted Bayesian multinomial logistic regression model (
bayesglm()
), where they
variable encodes the 3 sub-types. Thex
predictors would be your large list of genes. Standard glm will not work because there are too many variables.There is never a fixed way of doing this, and I find that the methodology changes each time that one needs to build a model. In the approach that you describe above, you indicate that you just took 5 genes from each sub-type. Did you identify these 5 as the most differentially expressed?
Here is another article, which seems useful: https://rstudio-pubs-static.s3.amazonaws.com/209720_d22eaa2eb5b54cb9b73aba094db8087a.html
Thank you sir, for detailed description. I would follow those steps. Yes these five were most significantly up-regualted within each subtype