I want to run Kernel Ridge Regression on a set of kernels I have computed, but I do not know how to do this in R. I found the constructKRRLearner function from CVST package, but the manual is not clear at all, especially for me being a complete beginner in Machine Learning. The function needs and x and y, but I have no idea what to input there, as I only have a data frame that has the pairwise kernel computed as kronecker product between drugs and proteins.
How can I do a Kernel Ridge Regression task in R?
Ideally I also want to visualize my data points and then illustrate the regression line on the plot! For instance like this:
http://scikit-learn.org/stable/_images/plot_kernel_ridge_regression_0011.png
MORE INFO ON MY DATASET
I have a drug-target interactions (DTI) data set. The data set comprises of 100 drug compounds (rows) and 100 protein kinase targets (columns). there are some NAN's (missing values) in this data set. Values in this data set reflect how tightly a compound binds to a target.
I have drugs' SMILES and CHEMBL IDs.
I have the protein's (targets) sequences and UNIPROT IDs.
For drugs [100 drugs]: I converted drug SMILES to SDFset, and then I computed the fingerprints for each drug using OpenBabel. Based on these fingerprints I computed Tanimoto kernels for all possible combinations between drugs. (using "fpSim" function), e.g. Drug 1 with Drug 2, 3, 4, ... 10. Then Drug 2 with Drug 1, 3, 4... 100 and so on until Drug 99 with Drug 100. I named this BASE_DRUG_KERNELS
For proteins: I had the protein sequences, so I computed Smith-Waterman scores for all combination of protein pairs; e.g. Protein 1 with Protein 2, 3, ... 100, then Protein 2 with Protein 1, 3, 4, ... 100 and so on until Protein 99 with Protein 100. I named this BASE_PROTEIN_KERNELS
Then I computed the Kronecker between BASE_DRUG_KERNELS and BASE_PROTEIN_KERNELS which gave me a matrix of 100,000,000 elements. I named this matrix KRONECKER_PRODUCTS
I wish to run Kernel Ridge Regression on the matrix KRONECKER_PRODUCTS.
Thanks a lot for your answer. I am quite lost, so I have to ask one more question, do you mean that I basically have to abandon any R functions and simply just find the kernel ridge regression with the code you provided? Moreover, what does "y" refer to? Sorry for the elementary questions.
y is your target vector or matrix e.g. response variables or classes of your samples that you're trying to model just as in linear regression.
EDIT: Forgot to answer the part about abandoning R: The α parameters are obtained as the (pseudo)inverse of the kernel matrix. You can use any linear algebra library for this. solve() is an R function that you can use for this purpose or you can use the ginv() function from the MASS package or get the (pseudo)inverse from the SVD.
Great! Then would it still make sense to alternatively find the alpha value by simply just writing the original formula in R, i.e.
alpha <- (K + lambda*I)^ -1 * y
. Would this be as correct as alpha <- solve(K,y)?This is the same thing except you'd need to optimize for lambda which in the end would give you the same solution (the purpose of lambda is just to make the matrix invertible). solve() finds x as solution to the equation y = Zx, i.e. Z^-1y.
Fantastic. Thank you once again for your most-detailed response. I'm really stuck with it for weeks, so I really appreciate your input very much. One last thing that remains is that, can the prediction function g(x) for the actual KRR prediction task also be simply implemented in the same in R? Or that requires more advanced programming? I am referring to this function:
To see what to do, put the equation into words: If we call K() a similarity function, the prediction for x is the weighted sum of the similarities of x to the training set elements.