Question

Calculating variable importance in model for microarray dataset

0

Entering edit mode

7.5 years ago

arronar ▴ 290

Hi.

I'm analyzing a microarray dataset that is consisted by 920 variables (genes) and 6 samples (treatments) with replications (10 replications per sample)

I ran a kNN classification on those data with knn() function in R for 100 different samples each time and I got an error rate about 30-35%. So i thought to somehow filter out the least significant for the model variables and then try again to see if the error rate decreased.

Is that though right ? How could i approach such an idea ? Running an lm() and gettin the f-statistic of them is it a good idea or there are other ways to do it?

Any idea/paper/tutorial that is related in such procedures is welcomed.

Thank you in advance.

R microarray R language modeling kNN • 1.7k views

ADD COMMENT • link 7.5 years ago by arronar ▴ 290

score 0 · Answer 1 · 2017-05-28

0

Entering edit mode

7.5 years ago

theobroma22 ★ 1.2k

Hi arronar,

I wouldn't recommend to remove outliers, as they can be useful for the analysis / testing procedure. Im curious about your kNN analysis. Did you use training set, and your 100 iterations (or as you say 100 different samples) seems a bit low. So, what if you used 1000 iterations? Also, by error rate...which error rate? Like RMSE or other?

Thanks.

ADD COMMENT • link 7.5 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

Hi theobroma22

My Initial matrix is 920x60. I used the 70% of this as training set and the rest 30% as test set. I did this 100 different times using sample() function and for k number 1:20. While the error rate i counted as below:

for(y in 1:100){
      train_set = sample(...)
      test_set = .....

      accuracy = numeric()

      # Repeat the prediction for 1:20 k values
      for(k in 1:20){
          # execute the kNN algorithm
          predict = knn( train = train_set , test = test_set , cl = train_labels , k=k)

          # Get the accuracy i.e the mean of the correctly predicted classes for each k value
          accuracy = c(accuracy , mean(predict == test_set$classes))
      }

      # Store each accuracy variable into accuuracy matrix.
      accuracy_matrix[s,] = accuracy
 }

    # Calculate the mean accuracy (mean of mean) for each k value
    mean_accuracy = as.numeric(lapply(accuracy_matrix,mean))

    # Plot the mean accuracy values for each k.
    plot(1-mean_accuracy ,type="l",ylab="Mean error rate",
         xlab="K",main="Error Rate for Treatments With Varying K (1:20) and 100 samples")

ADD REPLY • link 7.5 years ago by arronar ▴ 290

0

Entering edit mode

Ok. Can you do it again using just k=31? I got this number by taking the closest odd number to the square root of your 960 genes. You can use 100 samples again.
Thanks. My point may be that you have a larger k value than what you used (1:20).

ADD REPLY • link 7.5 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

Bad news. It has 80% error with k=31. Here is the image.

enter image description here

Also, Ii just see a mistake that I made in my initial post. I didn't mean 30-35%. I meant 50-55% error rate.

30-35% I can get only if i delete some of the replication treatments that has low correlation index with the rest of its replication. And as I can imagine this is not a wise way to go.

ADD REPLY • link 7.5 years ago by arronar ▴ 290

0

Entering edit mode

It's more clear for me now. So, based on what you just said, you can take the average of the reps, and use this as your input data, so it would be 960 * 6 matrix.

ADD REPLY • link 7.5 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

To be honest I didn't exactly understood what you just said :-)

ADD REPLY • link 7.5 years ago by arronar ▴ 290

0

Entering edit mode

Sorry 920 genes X 6 averaged reps of 10 reps per treatment or contrast. Is your data from a single experiment or pooled together from multiple experiments? It seems the 10 reps for each treatment is unwarranted, per se, since variability is "inherent", or there is a lot of human technical error, among some or all treatment reps. Thus, a large error rate. In base R, you can easily avereage the reps by using identical treatment names, then using colMeans().

ADD REPLY • link 7.5 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

If i understood after the colMeans() I'm going to have a new matrix with 6 (treatments) x 920 (genes). Isn't this matrix too small to get training and test data from it ? To train the kNN i need at least one occasion from each treatment. That means that I will use all 6 rows to train the algorithm. If that's the case which row am i going to use for test the training ?

ADD REPLY • link 7.5 years ago by arronar ▴ 290

0

Entering edit mode

A large training set helps to give a good model, but a large validation set increases the significance of the result you report. You need to strike a balance. The choice of what fraction to use for training may depend on how messy the data is, or how complex the modelling method is. A reasonable balance is 2/3 training and 1/3 validation. So, 612 x 6 training and 308 x 6 validation. Furthermore, in splitting the data you want the training set (and implicitly the validation set) to be representative of the real world. That is, you want the same fraction of functional genes in each set, but it seems this may be beyond your setup..Have you ever tried using the Mfuzz package available on Bioconductor? I would recommend this package for your analysis.

ADD REPLY • link 7.5 years ago by theobroma22 ★ 1.2k