Question

A simple regression in R

0

Entering edit mode

5.6 years ago

Learner ▴ 280

I am looking for a way to calculate AUC for a drug response . The data looks like the following

df<- structure(list(Compound = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor"), conc = c(0.00372, 
0.01368, 0.05004, 0.1836, 0.6732, 2.472, 9.048, 33.24, 121.2, 
446.4, 1632, 6000), Count = c(10200000, 10300000, 10200000, 10200000, 
9927000, 9400000, 8464040, 6513840, 4349320, 3071480, 5989320, 
8444560)), class = "data.frame", row.names = c(NA, -12L))

is there anyone who could help me

R • 3.2k views

ADD COMMENT • link 5.6 years ago by Learner ▴ 280

Kevin Blighe · Answer 1 · 2019-04-08

4

Entering edit mode

5.6 years ago

Kevin Blighe 88k

Your data looks like this:

df
   Compound      conc    Count conc.pred
1         A 3.720e-03 10200000  565.1263
2         A 1.368e-02 10300000  559.3499
3         A 5.004e-02 10200000  565.1263
4         A 1.836e-01 10200000  565.1263
5         A 6.732e-01  9927000  580.8959
6         A 2.472e+00  9400000  611.3376
7         A 9.048e+00  8464040  665.4025
8         A 3.324e+01  6513840  778.0542
9         A 1.212e+02  4349320  903.0858
10        A 4.464e+02  3071480  976.8991
11        A 1.632e+03  5989320  808.3526
12        A 6.000e+03  8444560  666.5278

What are you aiming to model? You cannot do anything with the compound column because it only has one value.

The only thing that we can do here is build a linear regression model between conc and Count:

model <- lm(conc ~ Count, data = df)
summary(model)

Call:
lm(formula = conc ~ Count, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-781.9 -620.7 -565.1 -552.1 5333.5 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.154e+03  1.836e+03   0.629    0.544
Count       -5.776e-05  2.176e-04  -0.265    0.796

Residual standard error: 1816 on 10 degrees of freedom
Multiple R-squared:  0.006999,  Adjusted R-squared:  -0.0923 
F-statistic: 0.07049 on 1 and 10 DF,  p-value: 0.796

It looks like there is no statistically significant relationship between these.

We can assess R^2 shrinkage using 3-fold cross-validation:

library(bootstrap)
theta.fit <- function(x, y) { lsfit(x,y) }
theta.predict <- function(fit, x) { cbind(1,x) %*% fit$coef }
x <- as.matrix(df[,which(colnames(df) %in% names(summary(model)$coefficients[,1]))])
y <- as.matrix(df[,c("conc")]) 
results <- crossval(x, y, theta.fit, theta.predict, ngroup = 3)

# Raw R^2 
cor(y, model$fitted.values)**2
            [,1]
[1,] 0.006999459

# cross-validated R^2
cor(y, results$cv.fit)**2
           [,1]
[1,] 0.07150013

It does not look like a good model, in fairness.

We can also plot the model-predicted conc values versus the real ones:

require(ggplot2)
df$conc.pred = predict(model)
ggplot(df, aes(x = conc, y = conc.pred)) +
  geom_point() +
  geom_smooth(method='lm',formula=y~x) +
  stat_smooth(method="lm", fullrange=TRUE) +
  ggtitle("My linear model")

The confidence intervals (grey shade) are extraordinary.

Finally, we can plot the ROC curve:

require(pROC)
roc <- roc(df$conc, fitted(model), smooth=FALSE)
plot.roc(
  roc,
  grid=TRUE,
  grid.lwd=2,
  col="royalblue",
  main="My linear model")
text(0.3, 0.45,
  col="royalblue",
  paste("AUC (L95, AUC, U95):",
    paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "),
  cex=1.0)

Hopefully this can 'foment' some ideas for you.

If you have other compounds in your data and want to model them as the y variables in the model, then you can switch to using a generalised linear model (glm()) in place of the linear model (lm()).

Kevin

ADD COMMENT • link 5.6 years ago by Kevin Blighe 88k

0

Entering edit mode

@Kevin Blighe Thanks Kevin, I thought that the concentration-response curves is based on the percent-viability scores which is often fit using smooth cubic splines for multivariate data . what I am mainly looking for is to calculate the AUC which should be only a value, right?

ADD REPLY • link 5.6 years ago by Learner ▴ 280

1

Entering edit mode

I wrote and re-wrote the response to your comment a few times and decided on this: you should have added much more information to your original question (or perhaps I should have requested it). In my very brief example that I give (above), if the model assumptions were correct, then you would have a single value for AUC.

You are talking about multivariate data, but what do you mean? You just have 3 columns in your data, and the Compound column is just all 'A' values.

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

0

Entering edit mode

@Kevin Blighe, So, let me give more information and see your comments, I will absolutely like and accept your answer even if it does not lead me to a correct answer. So, lets assume that I have a cell line which I treat with a compound in several concentration and I obtain the sensitivity values. If I have a replicate, should I treat them as a multivariate calibration or just do it over each replicate separately?

ADD REPLY • link 5.6 years ago by Learner ▴ 280

0

Entering edit mode

There is no right or wrong, I think. You can try them separate and then together (in a multivariate analysis). In the multivariate analysis, you could probably add cell-line as a covariate to see if there are cell-line specific effects in relation to drug sensitivity, too.

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

0

Entering edit mode

@Kevin Blighe could you please use the following data in your example? I am just wondering how you could do that

df<-structure(list(Drug_name = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "DrugT", class = "factor"), 
        Cell_line = structure(c(2L, 3L, 1L, 4L, 5L), .Label = c("Celline3", 
        "Cellline1", "Cellline2", "Cellline4", "Cellline5"), class = "factor"), 
        Concentrations_nM = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "3.1e-05,0.000114,0.000417,0.00153,0.00561,0.0206,0.0754,0.277,1.01,3.72", class = "factor"), 
        Responses = structure(c(4L, 2L, 3L, 5L, 1L), .Label = c("1,0.948118156866697,0.920303987764611,1.03610743904536,1.08332987533419,0.960086785898477,0.765642506120658,0.572520170014998,0.375835106792894,0.254180720963181", 
        "1,0.953703703703704,1,0.972222222222222,0.972222222222222,0.911407407407407,0.835722222222222,0.651211111111111,0.454325925925926,0.293437037037037", 
        "1,0.97679357161345,0.923286430268948,0.933051365914492,0.938592672768771,0.834813760323772,0.738458551965929,0.56534271394621,0.428992446881103,0.287483234899645", 
        "1,1.02970297029703,1.01980198019802,1.00990099009901,1.03960396039604,0.945219801980198,0.887081188118812,0.657615841584158,0.454776237623762,0.319762376237624", 
        "1,1.03893565587121,0.967905086838,1.06062943976775,0.886153353792697,0.897337425206165,0.719798149513376,0.553963001697327,0.372291848829301,0.301689897760328"
        ), class = "factor")), class = "data.frame", row.names = c(NA, 
    -5L))

ADD REPLY • link updated 5.6 years ago by Kevin Blighe 88k • written 5.6 years ago by Learner ▴ 280

3

Entering edit mode

Hey, can you not simply repeat my same code with your new data?

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

0

Entering edit mode

@Kevin Blighe do you know the cell to cell interaction

ADD REPLY • link 5.6 years ago by Learner ▴ 280

0

Entering edit mode

@Kevin Blighe you wrote me this to see if there are cell-line specific effects in relation to drug sensitivity actually I have been reading a lot about this. no one uses lm for such a analysis. the worse one can do is liner because basically it is not linear but nonlinear and sigmoidal ! however, thanks for all your time

ADD REPLY • link 5.6 years ago by Learner ▴ 280