I want to get a dataset of all known, validated cleavage sites, of prepro-hormone/protein precursors (e.g. insulin, neuropeptide precursors..), cleaved by PCs (e.g. Furin..).
I'm looking for only annotated or verified sites, so I can't just extract a window from any sequence containing a [KR][KR].
Uniprot has the keyword for "dibasic cleavage", but I don't know if it uses this on pro-hormones that have slightly different (non-canonical) cleavage patterns (which is what I'm interested in).
I though of using the search criter for "polypeptides", and looking in the sequence annotations for "gaps", but that approach is problematic. (Some sequences have the degraded cleavage site in between the polypeptide and a peptide or chain, but not always. I don't mind filtering them out in advance, but I don't know what to filer for).
So - how to get a good, large dataset of dibasic cleavage locations on prepropolypeptides?
(I am aware of cutDB and MEROPS, but I've never worked with them before, and don't know how to download and extract cleavage sites. The datasets of ProP and NeuroPred are out of date or very small and buggy).
Tips on how to easily get the cleavage sites (and location) would also be great - what's the easiest format to use when downloading from uniprot? (And how to parse it for the cleavage location on the sequence..).
Thanks!