I'm working on a project where I am trying to classify stretches of amino acids as phosphosites/non phosphosites. I'm getting the phosphosites from PhosphoBase, but as for getting a negative training set I'm not sure what to do.
When I randomly sample S/T/Y sites from the phosphobase proteins I get about the same % classified as tyrosine phosphosite/serine phosphosite/threonine phosphosite for the positive and negative datasets using regular expressions, which is what I'm using as the baseline to compare to. (The regular expressions are from prosite, the general regular expressions for the serine/threonine/tyrosine kinases)
Any thoughts on how I could generate a better negative dataset?
See this paper. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4917084/
The authors tried to do something similar to your task.
It's an age-old problem: is a site not phosphorylated or just not annotated as phosphorylated? In the past, one approach was to assume that these 2 statements were equivalent. Now, there is so much high-throughput proteomic data for some organisms (e.g. yeast) that you can probably assume the statements are truly equivalent in some cases.