Question

Negative Phosphosites Dataset Generation/Gathering

1

Entering edit mode

8.0 years ago

asdfxzsdf2 ▴ 10

I'm working on a project where I am trying to classify stretches of amino acids as phosphosites/non phosphosites. I'm getting the phosphosites from PhosphoBase, but as for getting a negative training set I'm not sure what to do.

When I randomly sample S/T/Y sites from the phosphobase proteins I get about the same % classified as tyrosine phosphosite/serine phosphosite/threonine phosphosite for the positive and negative datasets using regular expressions, which is what I'm using as the baseline to compare to. (The regular expressions are from prosite, the general regular expressions for the serine/threonine/tyrosine kinases)

Any thoughts on how I could generate a better negative dataset?

protein dataset sequence • 1.6k views

ADD COMMENT • link updated 8.0 years ago by Elisabeth Gasteiger ★ 2.4k • written 8.0 years ago by asdfxzsdf2 ▴ 10

0

Entering edit mode

See this paper. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4917084/

The authors tried to do something similar to your task.

ADD REPLY • link 8.0 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

It's an age-old problem: is a site not phosphorylated or just not annotated as phosphorylated? In the past, one approach was to assume that these 2 statements were equivalent. Now, there is so much high-throughput proteomic data for some organisms (e.g. yeast) that you can probably assume the statements are truly equivalent in some cases.

ADD REPLY • link 8.0 years ago by Neilfws 49k

score 0 · Answer 1 · 2016-11-22

From a UniProt FAQ http://www.uniprot.org/help/negative_datasets :

Curating a negative data set requires about as much manual curation as building a positive data set. The absence of an annotation does not mean absence of a function (a true negative). Lack of annotation may simply be due to false negatives: incompleteness either in the state of experiment-derived knowledge of a particular protein's function, or incompleteness in representing that knowledge as annotations, i.e. an entry may not be up-to-date and therefore does not have the positive annotation (yet).

In order to obtain a reliable predictor, we recommend to be extremely conservative when trying to build your set, and in case of doubt contact the UniProt helpdesk about the function or modification you are trying to predict.