Question

Calculating ROC50 in R

0

Entering edit mode

13 months ago

siu ▴ 160

Hi everyone,

I have a question regarding the ROC50 calculation for protein remote homology detection.

I have gone through different papers

"Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching"
"https://www.nature.com/articles/srep32333#Sec6"
"https://www.pnas.org/doi/10.1073/pnas.0308067101#sec-1"
"https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1842-2"

I have done homology search of my queries and I have the table (almost 50000 hits)

Query   Query_Fam   Target  Target_Fam  eValue  bitScore
d1i50a  PF04997 d1i50a  PF04997 0   2889
d1i50b  PF04565 d1i50b  PF04565 0   2420
d1i6vd  PF00623 d1i6vd  PF00623 0   2327
d1i6vc  PF04563 d1i6vc  PF04563 0   2194
d1htya  PF09261 d1htya  PF09261 0   2098
d1eula  PF00689 d1eula  PF00689 0   1974
d1qbkb  PF03810 d1qbkb  PF03810 0   1801
d1ygpa  PF00343 d1ygpa  PF00343 0   1774
d1ceza  PF14700 d1ceza  PF14700 0   1774
d1fiy   PF00311 d1fiy   PF00311 0   1749
d1qgra  PF13513 d1qgra  PF13513 0   1730
d2btva  PF01700 d2btva  PF01700 0   1693
d1a8i   PF00343 d1a8i   PF00343 0   1683
d1em6a  PF00343 d1em6a  PF00343 0   1637
d1qm5a  PF00343 d1qm5a  PF00343 0   1623
d2mysa2 PF00063 d2mysa2 PF00063 0   1603
g1gk9.1 PF01804 g1gk9.1 PF01804 0   1585
d1b7ta4 PF00063 d1b7ta4 PF00063 0   1584

The true positives and False positives labels will be based on the protein families to which the Query and Target proteins belong to.

I am not getting that how they are plotting the "Proportion of protein with given performance vs ROC50 values". Please let me know, If anyone aware of this problem and how to do it in R .

Thank you so much

ROC Protein R Homology • 946 views

ADD COMMENT • link 13 months ago by siu ▴ 160

0

Entering edit mode

I think you'd first need to know which are true positives and false positives

ADD REPLY • link 13 months ago by Jeremy Leipzig 22k

0

Entering edit mode

I will label a protein (Target) as a true positive if it belongs to the same protein family (Query), as indicated in the 'Query Fam' and 'Target Fam' columns. If the proteins do not belong to the same family, I will label them as false positives."

ADD REPLY • link 13 months ago by siu ▴ 160

score 0 · Answer 1 · 2023-11-02

tab <- read.table(text = "Query   Query_Fam   Target  Target_Fam  eValue  bitScore  hit
d1i50a  PF04997 d1i50a  PF04997 0   2889  TRUE
d1i50b  PF04565 d1i50b  PF04565 0   2420  TRUE
d1i6vd  PF00623 d1i6vd  PF00623 0   2327  TRUE
d1i6vc  PF04563 d1i6vc  PF04563 0   2194  TRUE
d1htya  PF09261 d1htya  PF09261 0   2098  FALSE
d1eula  PF00689 d1eula  PF00689 0   1974  TRUE
d1qbkb  PF03810 d1qbkb  PF03810 0   1801  TRUE
d1ygpa  PF00343 d1ygpa  PF00343 0   1774  TRUE
d1ceza  PF14700 d1ceza  PF14700 0   1774  FALSE
d1fiy   PF00311 d1fiy   PF00311 0   1749  TRUE
d1qgra  PF13513 d1qgra  PF13513 0   1730  FALSE
d2btva  PF01700 d2btva  PF01700 0   1693  TRUE
d1a8i   PF00343 d1a8i   PF00343 0   1683  TRUE
d1em6a  PF00343 d1em6a  PF00343 0   1637  TRUE
d1qm5a  PF00343 d1qm5a  PF00343 0   1623  TRUE
d2mysa2 PF00063 d2mysa2 PF00063 0   1603  TRUE
g1gk9.1 PF01804 g1gk9.1 PF01804 0   1585  TRUE
d1b7ta4 PF00063 d1b7ta4 PF00063 0   1584  FALSE",header=TRUE,stringsAsFactors=FALSE)

# draw an ROC curve
library(ROCR)
pred <- prediction(tab$bitScore, tab$hit)
perf <- performance(pred, "tpr", "fpr")
plot(perf, main = "ROC curve", colorize = TRUE)

# calculate the AUC
auc <- performance(pred, "auc")

#total number of FALSE
total_false <- length(tab$hit[tab$hit==FALSE])

#fpr to get to 50 FP's
fpr.stop <- min(50/total_false,1)

#what is the AUC at the 50th FALSE
AUC50 <- performance(pred, "auc", fpr.stop=fpr.stop)