Negative Phosphosites Dataset Generation/Gathering
1
1
Entering edit mode
8.1 years ago
asdfxzsdf2 ▴ 10

I'm working on a project where I am trying to classify stretches of amino acids as phosphosites/non phosphosites. I'm getting the phosphosites from PhosphoBase, but as for getting a negative training set I'm not sure what to do.

When I randomly sample S/T/Y sites from the phosphobase proteins I get about the same % classified as tyrosine phosphosite/serine phosphosite/threonine phosphosite for the positive and negative datasets using regular expressions, which is what I'm using as the baseline to compare to. (The regular expressions are from prosite, the general regular expressions for the serine/threonine/tyrosine kinases)

Any thoughts on how I could generate a better negative dataset?

protein dataset sequence • 1.7k views
ADD COMMENT
0
Entering edit mode

See this paper. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4917084/

The authors tried to do something similar to your task.

ADD REPLY
0
Entering edit mode

It's an age-old problem: is a site not phosphorylated or just not annotated as phosphorylated? In the past, one approach was to assume that these 2 statements were equivalent. Now, there is so much high-throughput proteomic data for some organisms (e.g. yeast) that you can probably assume the statements are truly equivalent in some cases.

ADD REPLY
0
Entering edit mode
8.0 years ago

From a UniProt FAQ http://www.uniprot.org/help/negative_datasets :

Curating a negative data set requires about as much manual curation as building a positive data set. The absence of an annotation does not mean absence of a function (a true negative). Lack of annotation may simply be due to false negatives: incompleteness either in the state of experiment-derived knowledge of a particular protein's function, or incompleteness in representing that knowledge as annotations, i.e. an entry may not be up-to-date and therefore does not have the positive annotation (yet).

In order to obtain a reliable predictor, we recommend to be extremely conservative when trying to build your set, and in case of doubt contact the UniProt helpdesk about the function or modification you are trying to predict.

ADD COMMENT

Login before adding your answer.

Traffic: 1622 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6