To my knowledge, protein interactions identified in experiments are usually compared with a reference dataset that is considered as gold standard. And that the gold standard interaction dataset is usually considered as containing true positive and true negative interactions. But I want to know how do we create such a gold standard dataset at the first place?
Hi Dr. Dlakic, Thank you for answering. I was thinking more about the process of rigorously defining the term 'gold standard dataset'. I am currently using machine learning algorithms to predict de novo interactions from PPI data and in every subfield of machine learning, there is a proper set of rules that is followed to get to the gold standard (e.g., In natural language processing, the British national corpus is considered a gold standard and they followed a protocol to create the dataset). Since the performance of the machine learning algorithm heavily depends on the quality of the data, I was trying to create the gold standard dataset for my project from scratch. That made me wonder what is the usual procedure to create such datasets for PPI data. I have been looking for relevant literature. I found this paper from 2010 :https://www.researchgate.net/publication/220173162_From_Experimental_Approaches_to_Computational_Techniques_A_Review_on_the_Prediction_of_Protein-Protein_Interactions . But I am still not so sure about how the process works.
My answer is the same after reading your added explanation.
I suggest you find a dataset that has already been used in quality publications, and test it with your own methodology. You are not the first to predict PPI and it would be helpful to others if your method can be compared to others using an existing dataset. If you come up with your own gold standard, it will be debatable whether your own performance is truly an improvement or simply a side-effect of a biased dataset you created.
So, you are suggesting that no set rules are followed by all manually curated gold standard datasets. And it's a process of trial and error?
I am suggesting no such thing. Please read carefully what I wrote, and also read through the papers that have described gold datasets creation.
https://pubmed.ncbi.nlm.nih.gov/14564010/