Question

SVM for classified gene expression data

1

Entering edit mode

9.3 years ago

Shaurya Jauhari ▴ 50

Hello.

This is a naive attempt. I am working on a sample-gene microarray dataset with 12 samples (6 normal and corresponding 6 tumor) and 356 EST values. How can I move forward with building an SVM classifier? The data is made available to me as a data frame in R. Also, is it a one-one classification or a one-all classification problem? Should I proceed with building a classifier for each (normal-tumor) pair columns and continue with all 6 such pairs to finally average out a result?

Thanks.

R svm microarray • 8.5k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.3 years ago by Shaurya Jauhari ▴ 50

4

Entering edit mode

Assuming you want to train a classifier to distinguish between normal and tumor, there are many tutorials on how to do this in R such as here, here and here.

However, I'd worry about the small size of your training set.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

This subset is filtered from a larger dataset. Generally, SVM takes a three column input: 2 classes and a response value. With reference to the problem statement, it seems my putative model has multiple covariates. How do I condense such a situation into a SVM compatible format?

ADD REPLY • link 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

I am not sure what you mean by input being "two classes and a response value". Typically classifiers like SVM work with a training a set of vectors (here this would be your EST values) with associated class labels (here normal/tumor). The real input to the SVM is actually a kernel matrix representing samples pairwise similarities. Using the kernlab package in R, you could do something like this:

# Assume data frame called EST_train has training samples in rows, 
# one column is named class and contains class labels,
# the other columns are EST values
# SVM using Gaussian kernel with parameter sigma=0.1,
# with 5-fold cross validation
classifier<-ksvm(class~.,data=EST_train,kernel="rbfdot",kpar=list(sigma=0.1),cross=5)
# Classify new samples in data frame called EST_test and
# output probability of class membership
predict(classifier,EST_test,type="probabilities")

ADD REPLY • link 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for the input. I do have class labels and EST values as columns. I've however included a factor array of -1 and 1, for distinguishing negative and positive examples (stratifying classes on the basis of being normal/ tumor). I tried the following:, but it wouldn't work. Note that I have 356 EST columns.

> install.packages("e1071")
> library(e1071)

> y <- c(rep(1,6),rep(-1,6))
> d <- data.frame(x=temp,y=as.factor(y))
> svmfit <- svm(y~., data=d, kernel="linear", cost=10, scale = FALSE)
> plot(svmfit,temp)

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

I don't know how much viable it is but with the use of kernlab, as suggested, the following was discerned:

> classifier
Support Vector Machine object of class "ksvm" 

SV type: eps-svr  (regression) 
 parameter : epsilon = 0.1  cost C = 1 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.1 

Number of Support Vectors : 12 

Objective Function Value : -4.4107 
Training error : 0.009992 
Cross validation error : 1.459338

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

In which way does it not work? Do you get an error message? Or is it not giving a good model e.g. high error? In the later case, there are a few things to try: first use scaling, second use another kernel (typically Gaussian/RBF seem to work well on many problems), third use cross-validation. Finally, keep in mind that your EST values might not be useful for this task and if so you'll never get a good model.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

There is no error message. But if I consider visualizing the results, it's not happening.

> plot(<svm model>, <data>)

There is some "missing formula" issue. Surely, I can try disparate kernels and consider playing with cost function to arrive at the optima with tune().

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

It seems you're using a regression SVM (SV type: eps-svr in the output above) so I think there's a problem with your data structure. First, it looks like y is not taken as factors then make sure that your temp structure only contains the EST values with samples as rows ordered as in y. The missing formula bit is because you have multiple variables but can only make a 2D plot so you need to select which variable to use in the plot. Try

labels<-as.factor(c(rep("normal",6),rep("tumor",6)))
temp[,"class"]<-labels
svmfit<-svm(class~., data=temp, kernel="linear")
plot(svmfit,temp,<y axis variable> ~ <x axis variable>)

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Factors have already been marked. Maybe you missed observing the statement: d <- data.frame(x=temp, y=as.factor(y))

temp was previously a EST matrix which later clubbed with a last column y (factors) and together converted as a data frame d. So, now the columns are all genes (356 in all) and rows are samples( first 6 are normal and bottom 6 are cancer, as defined by factors y). This is the current format I have.

The objective is to ascertain a threshold (quantitative) that stratifies "normal" and "cancer" expression space. The idea to adopt SVM deemed handy, so I went forth. Should I bother about linear separability? A plot will be a welcome addition as visualization brings noticeable articulation. However, I do have a cognizance of the fact that since multiple variables are involved therefore there cannot be a 2D plot. I tried your suggested code above, but what's <y axis variable> and <x axis variable>? Are they any (row, column) pair out of 356 columns(excluding a class label/ factor column) and 12 rows?

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

I didn't miss the as.factor part but the fact that your svm used regression by default indicated that it didn't see y as factors. <y axis variable> means select the column you want in y, same for x, e.g. if your columns have names like EST1, EST2 ... then do

plot(svmfit,temp, EST3 ~ EST1)

Given that you have multiple variables, what kind of threshold are you talking about? One for each variable? Or do you want to find which variables are best predictors of the classes?

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for your constant supervision. I am kind of amateur to data analysis and SVM in particular.

This is the present state of my data. My data frame is (13*356) in dimension. (13: 12 samples + 1 class label/factors). Usually, a column is picked by <data frame>$<column name>. While trying so for plotting, an error occurs:

> temp <- WorkFinalPDEG
> temp$ID_REF <- NULL
> temp <- log2(temp)
> temp <- t(temp)
> d <- data.frame(temp)
> colnames(d)<- WorkFinalPDEG$ID_REF
> labels<-as.factor(c(rep("normal",6),rep("tumor",6)))
> d[ ,"class"]<-labels
> svmfit <- svm(class~., data=d, kernel="linear", cost=10, scale = FALSE)
> plot(svmfit,d, d$x.1552343_s_at ~ d$x.1552509_a_at)
Error in xy.coords(x, y, xlabel, ylabel, log) :
  'x' and 'y' lengths differ

Also, let me clarify on the threshold part. There are expression values of 356 genes (columns) tabulated across 12 samples (rows). First 6 rows have normal values and last 6 have tumor values. Thus, the factor values are set so. Now, I wish to find the equation of a line/plane that separates the normal and cancer expression spaces. To add, I am looking for a general plot for the whole data frame and not for individual gene measurements.

Thanks in advance.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

Am I trying regression or classification, here?

ADD REPLY • link 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

I don't understand why/how the class labels can be a row unless they indicate classes for the column variables. So first thing to do is to remove that extra row so that you have a 12 x 356 data frame. Then if it already has samples as rows, why do you transpose it ?

ADD REPLY • link 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

This is a classification problem. However, if you want a linear separation in terms of the original features, then I think it would be simpler to use a linear discriminant analysis approach.

ADD REPLY • link 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

The following code finally worked:

> library(e1071)
> temp <- WorkFinalPDEG
> temp$ID_REF <- NULL
> temp <- log2(temp)
> temp <- t(temp)
> y <- c(rep(1,6),rep(-1,6))
> d <- data.frame(x=temp,y=as.factor(y))
> svmfit <- svm(y~., data=d, kernel="linear", cost=10, scale = FALSE)

To test a plot, I included first two genes(columns):

> plot(svmfit,d, x.1 ~ x.2)

The plot however shows no separation of spaces. How do I interpret the results here?

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

The two classes can't be distinguished with these two genes. Since you're using the linear kernel, the feature weights correspond directly to the gene contributions. You could try plotting with the two genes with highest weights. You can get them with:

W<-t(svmfit$coefs)%*%svmfit$SV

which are also the parameters of the hyperplane you're looking for. For completeness, the bias vector is svmfit$rho.

Note that you should use the tune() function and most likely also scale the input for good results.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you Dr. Heriche for the support. I'll take it from here.

ADD REPLY • link 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

I tried again. Still shows an error.

> library(e1071)
> temp <- WorkFinalPDEG
> temp$ID_REF <- NULL
> temp <- log2(temp)
> temp <- t(temp)
> d <- data.frame(temp)
> colnames(d)<- WorkFinalPDEG$ID_REF
> labels<-as.factor(c(rep("normal",6),rep("tumor",6)))
> d[ ,"class"]<-labels
> svmfit <- svm(class~., data=d, kernel="linear", cost=10, scale = FALSE)
> plot(svmfit,d, '1552343_s_at' ~ '1552509_a_at')
Error in terms.formula(formula, data = data) : 
  invalid term in model formula

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

Don't quote the variable names. Just write them as they appear when you do colnames(d).

ADD REPLY • link 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

What is your goal in building the classifier? Are you going to apply to new samples?

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Sean Davis 27k

0

Entering edit mode

For now, I aim to stratify the gene expression space (feature space) as normal and cancer subspaces.

ADD REPLY • link 9.2 years ago by Shaurya Jauhari ▴ 50

0

Entering edit mode

There are 6 normal samples(columns) and corresponding 6 tumor samples, 12 in total. Broadly the classes are 2, viz. tumor and normal.

ADD REPLY • link 9.2 years ago by Shaurya Jauhari ▴ 50