Question

Will the real kinases please stand up?

2

Entering edit mode

10.3 years ago

cdsouthan ★ 1.9k

In 2008 http://www.ncbi.nlm.nih.gov/pubmed/18436524 reported 480 classical and 24 atypical human protein kinases. Good paper but no supp data listing.

So if I query GO protein serine/threonine/tyrosine kinase activity, GO:0004712 today I get 577

http://www.uniprot.org/uniprot/?query=go%3A0004672+AND+reviewed%3Ayes+AND+organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22&sort=score

There are some obvious false positives coming in such as neuropilin http://www.uniprot.org/uniprot/O14786

As an altenative, the current Swiss-Prot human kinase list has 522 entries (http://www.uniprot.org/docs/pkinfam.txt)

And if I select protein kinase domain as IPR000719 I get 481

http://www.uniprot.org/uniprot/?query=reviewed%3Ayes+AND+organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22+AND+database%3A%28type%3Ainterpro+IPR000719%29&sort=score

So why are these numbers so different? (OK so I guess IPR000719 is just the "classical" set)

kinase Swiss-Prot • 2.9k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by cdsouthan ★ 1.9k

1

Entering edit mode

Have you looked at the differences between sets ? Also make sure you're counting properly. For example, the uniprot file has 522 lines with kinases but some of these lines only have one human or one mouse kinase so there are 522 kinases that exist either in the human or mouse genome but some human kinases have no mouse homolog and conversely some mouse kinases have no human homolog.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I can do the intersects and diffs of course but was hoping for some explanations from UniProt first. I politely posit that one of their jobs is to facilitate clean retrievals of key sets like this. Yes. I spotted the odd formatting in the 600 lines of pkinfam.txt pasted into Exel so I sliced out just the 522 human IDs (why not offer a CSV download?) but it would be useful if there was a clean select for these in the interface

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by cdsouthan ★ 1.9k

Ram · Answer 1 · 2015-02-20

0

Entering edit mode

10.3 years ago

Malachi Griffith 20k

DGIdb has aggregated a couple of lists of kinases:

Kinases

Tyrosine Kinases

You can slice and dice them a few different ways. You can also export as a simple TSV file or access via a the DGIdb API.

The list from the dGene publication has been vetted more than some of the other lists out there.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Malachi Griffith 20k

0

Entering edit mode

Notification appreciated thanks, but this list also exemplifes the problem rather than the solution. The intersect of 821 gene symbols from above with 513 from GO:0004712 was only 109 (i.e. there is no provenanced vetting of any lists out there?)

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by cdsouthan ★ 1.9k

1

Entering edit mode

To my knowledge, no there is not a provenanced vetting of kinase lists out there. There are various kinase databases and efforts such as that described in the dGene publication. These arrive at different solutions for various reasons and for the most part to not have much of what you would consider formal provenance/tracking. In general, this kind of activity is very difficult to get funded, despite its high value and the efficiencies it could promote. This is perhaps very slowly starting to change. For example, the NIH release this RFA: http://grants.nih.gov/grants/guide/rfa-files/RFA-RM-13-011.html "Development of a Knowledge Management Center for Illuminating the Druggable Genome (U54)". All that aside, one reason these lists differ so much is that their composition is still being debated. Some of the shorter lists are full of kinases that are very well established as having kinase activity. The longer lists such as those from GO contain kinases with much more speculative evidence (e.g. based on sequence homology). Some of these kinases are at this point a hypothesis. The RFA I list above was accompanied by another (http://grants1.nih.gov/grants/guide/rfa-files/RFA-RM-13-010.html) that had the goal of: "unveiling of the functions of the poorly characterized and/or un-annotated members in four protein classes of the Druggable Genome". Yes, this gene family has been heavily studied for many years, but we have mostly been proposing conservative research (i.e. fundable) on the same handful of kinases over and over. This RFA was aimed getting into the dark matter of the Druggable Genome (where kinases are a major player). Together these RFAs seem like a great idea. Only time will tell what comes of it.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Malachi Griffith 20k

0

Entering edit mode

Don't rely on gene symbols, these tend to change over time. They certainly changed between 2008 and now. On top of this, the same symbol has been used for different kinases, for example STK36 was used for what is now officially AURKB and STK36 is now the official symbol of the fused homolog previously known as FU. Depending on when a list was compiled, a symbol or another can be used for the same gene. In general, I don't advise working directly with gene identifiers from disparate sources. The strategy I use to deal with this is to always start by mapping genes from all my data sources to a common reference (e.g. a specific Ensembl version) and proceed from there. Not all genes make it through the procedure but at least I get traceable data. Occasionnally, I "rescue" genes/proteins by mapping their sequences to Ensembl because the identifier used is not traceable.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hmmm, I agree on the problems but the point is that at least between UniProt, HGNC and Ensembl (i.e. Hinxton) that these aspects for major human gene famlies (10 years post-completion) should be getting locked-down by now i.e. we should't need to continually go round in x-mapping circles, e.g.

question on genes with ensembl gene ID, but without associated gene name and corresponding Entrez ID

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by cdsouthan ★ 1.9k