Question

For virtual screning, finding bioactivity data for the protein targets with no bioactivity data available?

0

Entering edit mode

5.0 years ago

entropy ▴ 50

Hi,

Using Chembl bioactivity data (mainly with SMILES), I trained my Deep Learning model for hit detection of my target proteins via virtual screning. However, Chembl does not have compound bioactivity data for all the proteins. I have some proteins of interest that is not available either in Chembl or in PubChem.

I need some bioactivity data for those proteins to fine tune my DL model. Is there a way that I can still use my DL model and predict hits for those proteins using the model? Essentially, can you suggest me how can I get, alternative, bioactivity data for those proteins? I was thinking to use bioactivity data of similar proteins to my target proteins of interest. If this is the right approach, can you suggest me tutorials that demonstrates protein similarity search? If possible programatically?

thanks

compound protein drug target hit detection ML • 1.4k views

ADD COMMENT • link updated 5.0 years ago by matteoferla ▴ 30 • written 5.0 years ago by entropy ▴ 50

score 2 · Accepted Answer · 2019-12-11

Okay, I'll split my answer in two and you are asking two things. (And I am assuming that by "all protein" you mean all the protein you care about. Not, the whole human proteome say...)

Find homologues

To find homologues of a protein BLAST is the best tool. It is a GUI, but there is a command line version to run locally assuming you download all the protein (a lot) or a Restful API, which has many wrappers including in Biopython. Proteins diverge most on the surface so similarity above 70% is still okay for your hits, but lower might be iffy. Lower than 20% and short match is likely junk. Proteins are divided into families and PFam ID are a great resource and appear in uniprot entries for protein, which is one of the best general use DB for protein.

Alt data

if not and Chembl or in PubChem are working for you, you might need get creative.

When you say you did a virtual screen, did you dock a library and potential cross-validate the scores? If not and instead you split up your molecules by functional group and properties (QSAR kind of thing) and used that data to train a model to predict the bioactivity outcome when available. In which, case, in the case of your target protein where you don't have bioactivity data, you could do docking to see if the data correlates. To do that you might need to make a structural model of your protein if a 3D structure is not available in PDB. You might find a threaded model in Expasy Swissmodel, but if not then using an model generated with ITasser or Phyre (Threaded/ab initio) will be no good for docking.

Alternatively and totally a long shot... if your protein is an enzyme and you just need a few targets, BRENDA is a good resource to find the parameters of the native metabolites and promiscuous activities —which can be used a positive controls for any virtual screen, although metabolites aren't very drug like (sensu Lipinski's rule of five). That is, whereas Uniprot or MetaCyc and other sites just give the physiological substrate, BRENDA will tells you about similar compounds that still bind, but badly.