Question

Understanding Kernel Ridge Regression for Drug-Target Interaction

4

Entering edit mode

8.6 years ago

' ▴ 330

I am trying to understand how KRR works for drug-protein-interaction and many aspects of it seem very confusing.

Supposing I have a data set as follows of Drug-Protein interactions; values show how tightly a drug binds to a target, some of the interactions are missing (NaN), and those are the ones I am trying to predict. Numbers I am giving here are only and only made-up numbers for the sake of explanation, since I cannot copy the entire data set as it contains 100 drugs and 100 proteins. So every number you see here is just a random number!

           [,Protein1] [,Protein2] [,Protein3] [,Protein4] [,Protein5] [,Protein6]
[Drug1,]  6.763232 8.97455 5.655 3.3245454 NaN 3.9232321
[Drug2,]  1.211123 2.34343 9.344 NaN 5.6445 4.343
[Drug3,]  1.3429286  2.8805642 6.1998635 Nan 2.328635 9.34343
[Drug4,]  6.5210577  7.1228635 NaN 4.1228635 4.9998635 6.002805
[Drug5,]  NaN  0.9230754 8.34343 9.09098 7.66575 3.9900
[Drug6,] 1.2167197 0.6700215 0.999 NaN 5.553 1.34343

The approach used in drug discovery is then to compute similarities between proteins and similarities between drugs.

Therefore, there is a Drug Kernel computed to show similarities between all drugs (e.g. from online databases).

           [,Drug1] [,Drug2] [,Drug3] [,Drug4] [,Drug5] [,Drug6]
[Drug1,]  6.454 8.788 5.655 3.3245454 3.32233 3.9232321
[Drug2,]  6.211123 7.34343 9.344 1.2121 5.6445 4.343
[Drug3,]  5.3429286  2.8805642 6.1998635 6.7765 2.328635 9.34343
[Drug4,]  4.5210577  1.1228635 7.34 2.1228635 3.9998635 5.002805
[Drug5,]  9.34  0.9230754 1.34343 9.09098 7.66575 3.9900
[Drug6,]  1.2167197 0.6700215 1.999 1.23 5.553 1.34343

And then protein similarities are computed based on some approach. This matrix will be the Protein Kernel.

           [,Protein1] [,Protein2] [,Protein3] [,Protein4] [,Protein5] [,Protein6]
[Protein1,]  50 80 90 10 20 30
[Protein2,]  60 70 10 10 35 75
[Protein3,]  99 89 51 69 48 10
[Protein4,]  10 54 68 97 64 17
[Protein5,]  60 58 95 64 10 16
[Protein6,]  88 14 97 63 63 10

Then the Kronecker Product is computed for Drug Kernel and Protein Kernel, which directly relates protein-drug pairs.

Here K is the matrix containing Kronecker Products. So basically, it's a bigger matrix, for this case where we have 6 Proteins and 6 Drugs, the K matrix becomes a 36 x 36 matrix.

Now alpha coefficients are computed for Kernel Ridge Regression with the following formula.

K is the kernel matrix that relates drug-target pairs [therefore, Kronecker Products] y is the vector with the labels (binding affinities) [So I assume it is just the vector version of the very first matrix in this post, that is the Drug-Protein interaction matrix, is this correct?] I is the identity matrix (of the same size as the kernel matrix), lambda is the regularization parameter, set preferably to 0.1.

Up to here, I have been able to do everything in R. But my problem starts when I have to do the actual prediction. I do not understand the idea behind KRR, and how to predict those NaN values based on the Kronecker Product K matrix values..

The formula for KRR is: To compute the prediction for the test point using the equation for g(x) this is the formula

where x is a test point and x_i’s are training points

My biggest confusion here is, WHAT should I actually put instead of X and X_i? Out of all the matrices I have, which is X for the formula above and which one contains the X_i values? And how actually can the values in the K matrix be the basis for predicting the values in the very first matrix here?!

Any help and guidance will be extremely appreciated as I am very confused understanding how KRR works, especially understanding how it works for Drug-Target interaction when having Kronecker Products. So any input here will be really welcome

(http://arxiv.org/pdf/1601.01507.pdf A paper analyzing what I am trying to do, i.e. relating drugs to proteins by Kronecker Products and then applying KRR, reading the whole paper didn't really clear up anything for me.)

=#############################=

EDIT #1

My Drug-Target interaction data looks like this (contains 100 proteins [columns; UNIProtIDs] and 100 drugs [rows; ChEmbl id's]), about 18% of the binding affinities are missing and I am trying to predict those.

enter image description here

I calculated similarities between each two drugs and made one matrix from it and calculated similarities between each two proteins and made another matrix for that. This was done by computing OpenBabel-Fingerprint-based Tanimoto kernels for each drug pair (this was assumed to be my Drug Kernel) and by computing pairwise Smith-Waterman scores for each protein pair (assumed this is the Protein Kernel). Then I took the Kronecker Product of these two base Kernels which created a 10,000 x 10,000 matrix that looks like this:

enter image description here

kernel regression r ridge kernel trick • 3.0k views

ADD COMMENT • link updated 4.1 years ago by casey ▴ 20 • written 8.6 years ago by ' ▴ 330

0

Entering edit mode

@Lazarus how did you come up with this data table? how did you measure ? what are the relations between those drugs and those proteins?

ADD REPLY • link 8.5 years ago by Learner ▴ 280

0

Entering edit mode

@Learner: Which of the data tables are you specifically referring to? Do you mean the first one containing drugs and proteins and their binding affinities? If so, then that one was taken directly from an experimental study's dataset (Metz et. al (2013)). But if you are referring to protein-protein similarity matrix, that data table was computed in R based on Smith-Waterman alignment. Drug-drug similarities were also computed in R using fingerprintOB and fpSim (from ChemmineR).

ADD REPLY • link 8.5 years ago by ' ▴ 330

0

Entering edit mode

@Lazarus it seems very interesting to me !!!! I have some idea to use for this data, can you please let me know if the original data was taken from "Navigating the kinome" paper? I am talking about the first data table

ADD REPLY • link 8.5 years ago by Learner ▴ 280

0

Entering edit mode

4.1 years ago

casey ▴ 20

I know it's been a while since this was posted, but just in case this is a helpful tool for a drug-protein binding-affinity prediction that takes in a SMILE sequence and an Amino Acid sequence and returns an integer that represents the binding-affinity. Here is the link to the model overview: https://model.modelforest.ai/binding-model

ADD COMMENT • link 4.1 years ago by casey ▴ 20

score 5 · Accepted Answer · 2016-05-09

5

Entering edit mode

8.6 years ago

Jean-Karim Heriche 27k

In your case, y is indeed the vector of affinities. Your training set should be composed only of those drugs and proteins for which you have an entry in y, i.e. when computing alpha, K should not have entries for the drug-protein pairs you want to predict. What you want to predict is a new entry in y so g(x) is this new entry and x corresponds to the drug-protein pair you're trying to make a prediction for. x_i is any of the drug-protein pair from the training set. You can view your matrix K as a matrix of similarities between drug-protein pairs. So to predict, you compute the weighted sum of the similarities between your drug-protein pair of interest and all the others from the training set. For a given i, K(x_i,x) is the entry in K corresponding to x_i (drug-protein pair of index i) and x (the drug-protein pair your making prediction for).

ADD COMMENT • link 8.6 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

This really cleared up many things! Thanks a bunch. I am just still quite stuck in the programming logic side of it, most importantly I am not certain how to format my data, or maybe it can be just that the implementations I have found of KRR are not designed for this purpose (i.e. DTI prediction based on Kronecker Products). To give a better overview of my dataset and what I have so far, I have included screenshots of the data frames under section EDIT #1 of my original post. There are few more questions I wish to ask:

As you have also confirmed, y needs to be n x 1 vector of the data shown in my first screenshot under section EDIT #1. However, those with NaN value should NOT be included. But taking the programming side, is it necessary that the n x 1 vector includes the protein and drug IDs as its row headers? (actually my 4th question highlights my confusion here)
How should the format of the test set (the set including those NaN's I am trying to predict) values look like? e.g. should it be a dataframe only containing the protein id's and drug id's for which binding affinities are NaN? (again question 4 shows why I'm particularly asking this)
Are there perhaps any implementations out there that can actually be used for this particular problem that uses Kronecker Products? I have found KRRLearner from CVST package, and this Matlab implementation. But it looks like these do not work for this particular problem.
Most of the implementations I have come across are quite the same, but for all of them, I have problems with the data input, and I am not sure which subset of my datasets I should provide for each part. For instance, taking the Matlab implementation, is it correct to say: in_data = this is the entire K matrix containing Kronecker products perhaps? out_data = y, an n x 1 vector made out of the DTI data (first screenshot) without NaN values included test_data = ? Most problematic part.. I have absolutely no idea what is required for this part...

I am sorry for my elementary questions, I am extremely new to Machine Learning, in fact this is my first Machine Learning project.

ADD REPLY • link 8.6 years ago by ' ▴ 330

1

Entering edit mode

1- Whether to keep headers or not depends on the implementation. However, the computation relies on items having the same index in all vectors/matrices i.e. y1 should be the same drug-protein pair as on the first row/column of K. 2- Once you've computed K for all drug-protein pairs, the training set K_train is formed by removing rows and columns corresponding to elements that have NaN in your target vector. The test set K_test i.e. the K(x_i,x) in the equation above corresponds to the columns of K corresponding to items with NaN in y. 3- The Matlab code you linked to expects as input a matrix of data points where each point is a vector and computes a Gaussian kernel from these vectors. In your case, you already have the kernel. As far as I can tell the CVST package also computes a kernel from the input data. You don't need a package. It's just linear algebra. 4- Once you have K_train as above, you can get the parameters simply with (in R): alpha<-solve(K_train,y_train) and the predictions with g<-alpha%*%K_test

ADD REPLY • link 8.6 years ago by Jean-Karim Heriche 27k