Hi everyone,
I have a question regarding the ROC50 calculation for protein remote homology detection.
I have gone through different papers
"Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching"
"https://www.nature.com/articles/srep32333#Sec6"
"https://www.pnas.org/doi/10.1073/pnas.0308067101#sec-1"
"https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1842-2"
I have done homology search of my queries and I have the table (almost 50000 hits)
Query Query_Fam Target Target_Fam eValue bitScore
d1i50a PF04997 d1i50a PF04997 0 2889
d1i50b PF04565 d1i50b PF04565 0 2420
d1i6vd PF00623 d1i6vd PF00623 0 2327
d1i6vc PF04563 d1i6vc PF04563 0 2194
d1htya PF09261 d1htya PF09261 0 2098
d1eula PF00689 d1eula PF00689 0 1974
d1qbkb PF03810 d1qbkb PF03810 0 1801
d1ygpa PF00343 d1ygpa PF00343 0 1774
d1ceza PF14700 d1ceza PF14700 0 1774
d1fiy PF00311 d1fiy PF00311 0 1749
d1qgra PF13513 d1qgra PF13513 0 1730
d2btva PF01700 d2btva PF01700 0 1693
d1a8i PF00343 d1a8i PF00343 0 1683
d1em6a PF00343 d1em6a PF00343 0 1637
d1qm5a PF00343 d1qm5a PF00343 0 1623
d2mysa2 PF00063 d2mysa2 PF00063 0 1603
g1gk9.1 PF01804 g1gk9.1 PF01804 0 1585
d1b7ta4 PF00063 d1b7ta4 PF00063 0 1584
The true positives and False positives labels will be based on the protein families to which the Query and Target proteins belong to.
I am not getting that how they are plotting the "Proportion of protein with given performance vs ROC50 values". Please let me know, If anyone aware of this problem and how to do it in R .
Thank you so much
I think you'd first need to know which are true positives and false positives
I will label a protein (Target) as a true positive if it belongs to the same protein family (Query), as indicated in the 'Query Fam' and 'Target Fam' columns. If the proteins do not belong to the same family, I will label them as false positives."