Hi.
I'm analyzing a microarray dataset that is consisted by 920 variables (genes) and 6 samples (treatments) with replications (10 replications per sample)
I ran a kNN classification on those data with knn() function in R for 100 different samples each time and I got an error rate about 30-35%. So i thought to somehow filter out the least significant for the model variables and then try again to see if the error rate decreased.
Is that though right ? How could i approach such an idea ? Running an lm() and gettin the f-statistic of them is it a good idea or there are other ways to do it?
Any idea/paper/tutorial that is related in such procedures is welcomed.
Thank you in advance.
Hi theobroma22
My Initial matrix is 920x60. I used the 70% of this as training set and the rest 30% as test set. I did this 100 different times using sample() function and for k number 1:20. While the error rate i counted as below:
Ok. Can you do it again using just k=31? I got this number by taking the closest odd number to the square root of your 960 genes. You can use 100 samples again.
Thanks. My point may be that you have a larger k value than what you used (1:20).
Bad news. It has 80% error with k=31. Here is the image.
Also, Ii just see a mistake that I made in my initial post. I didn't mean 30-35%. I meant 50-55% error rate.
30-35% I can get only if i delete some of the replication treatments that has low correlation index with the rest of its replication. And as I can imagine this is not a wise way to go.
It's more clear for me now. So, based on what you just said, you can take the average of the reps, and use this as your input data, so it would be 960 * 6 matrix.
To be honest I didn't exactly understood what you just said :-)
Sorry 920 genes X 6 averaged reps of 10 reps per treatment or contrast. Is your data from a single experiment or pooled together from multiple experiments? It seems the 10 reps for each treatment is unwarranted, per se, since variability is "inherent", or there is a lot of human technical error, among some or all treatment reps. Thus, a large error rate. In base R, you can easily avereage the reps by using identical treatment names, then using colMeans().
If i understood after the colMeans() I'm going to have a new matrix with 6 (treatments) x 920 (genes). Isn't this matrix too small to get training and test data from it ? To train the kNN i need at least one occasion from each treatment. That means that I will use all 6 rows to train the algorithm. If that's the case which row am i going to use for test the training ?
A large training set helps to give a good model, but a large validation set increases the significance of the result you report. You need to strike a balance. The choice of what fraction to use for training may depend on how messy the data is, or how complex the modelling method is. A reasonable balance is 2/3 training and 1/3 validation. So, 612 x 6 training and 308 x 6 validation. Furthermore, in splitting the data you want the training set (and implicitly the validation set) to be representative of the real world. That is, you want the same fraction of functional genes in each set, but it seems this may be beyond your setup..Have you ever tried using the Mfuzz package available on Bioconductor? I would recommend this package for your analysis.