Entering edit mode
8.9 years ago
Shaurya Jauhari
▴
50
Hello.
This is a naive attempt. I am working on a sample-gene microarray dataset with 12 samples (6 normal and corresponding 6 tumor) and 356 EST values. How can I move forward with building an SVM classifier? The data is made available to me as a data frame in R. Also, is it a one-one classification or a one-all classification problem? Should I proceed with building a classifier for each (normal-tumor) pair columns and continue with all 6 such pairs to finally average out a result?
Thanks.
Assuming you want to train a classifier to distinguish between normal and tumor, there are many tutorials on how to do this in R such as here, here and here.
However, I'd worry about the small size of your training set.
This subset is filtered from a larger dataset. Generally, SVM takes a three column input: 2 classes and a response value. With reference to the problem statement, it seems my putative model has multiple covariates. How do I condense such a situation into a SVM compatible format?
I am not sure what you mean by input being "two classes and a response value". Typically classifiers like SVM work with a training a set of vectors (here this would be your EST values) with associated class labels (here normal/tumor). The real input to the SVM is actually a kernel matrix representing samples pairwise similarities. Using the kernlab package in R, you could do something like this:
Thanks for the input. I do have class labels and EST values as columns. I've however included a factor array of -1 and 1, for distinguishing negative and positive examples (stratifying classes on the basis of being normal/ tumor). I tried the following:, but it wouldn't work. Note that I have 356 EST columns.
I don't know how much viable it is but with the use of kernlab, as suggested, the following was discerned:
In which way does it not work? Do you get an error message? Or is it not giving a good model e.g. high error? In the later case, there are a few things to try: first use scaling, second use another kernel (typically Gaussian/RBF seem to work well on many problems), third use cross-validation. Finally, keep in mind that your EST values might not be useful for this task and if so you'll never get a good model.
There is no error message. But if I consider visualizing the results, it's not happening.
There is some "missing formula" issue. Surely, I can try disparate kernels and consider playing with cost function to arrive at the optima with tune().
It seems you're using a regression SVM (SV type: eps-svr in the output above) so I think there's a problem with your data structure. First, it looks like y is not taken as factors then make sure that your temp structure only contains the EST values with samples as rows ordered as in y. The missing formula bit is because you have multiple variables but can only make a 2D plot so you need to select which variable to use in the plot. Try
Factors have already been marked. Maybe you missed observing the statement:
d <- data.frame(x=temp, y=as.factor(y))
temp was previously a EST matrix which later clubbed with a last column y (factors) and together converted as a data frame d. So, now the columns are all genes (356 in all) and rows are samples( first 6 are normal and bottom 6 are cancer, as defined by factors y). This is the current format I have.
The objective is to ascertain a threshold (quantitative) that stratifies "normal" and "cancer" expression space. The idea to adopt SVM deemed handy, so I went forth. Should I bother about linear separability? A plot will be a welcome addition as visualization brings noticeable articulation. However, I do have a cognizance of the fact that since multiple variables are involved therefore there cannot be a 2D plot. I tried your suggested code above, but what's
<y axis variable>
and<x axis variable>
? Are they any (row, column) pair out of 356 columns(excluding a class label/ factor column) and 12 rows?I didn't miss the as.factor part but the fact that your svm used regression by default indicated that it didn't see y as factors.
<y axis variable>
means select the column you want in y, same for x, e.g. if your columns have names like EST1, EST2 ... then doGiven that you have multiple variables, what kind of threshold are you talking about? One for each variable? Or do you want to find which variables are best predictors of the classes?
Thanks for your constant supervision. I am kind of amateur to data analysis and SVM in particular.
This is the present state of my data. My data frame is (13*356) in dimension. (13: 12 samples + 1 class label/factors). Usually, a column is picked by
<data frame>$<column name>
. While trying so for plotting, an error occurs:Also, let me clarify on the threshold part. There are expression values of 356 genes (columns) tabulated across 12 samples (rows). First 6 rows have normal values and last 6 have tumor values. Thus, the factor values are set so. Now, I wish to find the equation of a line/plane that separates the normal and cancer expression spaces. To add, I am looking for a general plot for the whole data frame and not for individual gene measurements.
Thanks in advance.
Am I trying regression or classification, here?
I don't understand why/how the class labels can be a row unless they indicate classes for the column variables. So first thing to do is to remove that extra row so that you have a 12 x 356 data frame. Then if it already has samples as rows, why do you transpose it ?
This is a classification problem. However, if you want a linear separation in terms of the original features, then I think it would be simpler to use a linear discriminant analysis approach.
The following code finally worked:
To test a plot, I included first two genes(columns):
The plot however shows no separation of spaces. How do I interpret the results here?
The two classes can't be distinguished with these two genes. Since you're using the linear kernel, the feature weights correspond directly to the gene contributions. You could try plotting with the two genes with highest weights. You can get them with:
which are also the parameters of the hyperplane you're looking for. For completeness, the bias vector is
svmfit$rho
.Note that you should use the
tune()
function and most likely also scale the input for good results.Thank you Dr. Heriche for the support. I'll take it from here.
I tried again. Still shows an error.
Don't quote the variable names. Just write them as they appear when you do colnames(d).
What is your goal in building the classifier? Are you going to apply to new samples?
For now, I aim to stratify the gene expression space (feature space) as normal and cancer subspaces.
There are 6 normal samples(columns) and corresponding 6 tumor samples, 12 in total. Broadly the classes are 2, viz. tumor and normal.