suppose you have a new algorithm that you want to publish. Are there any best practices and methodologies you usually consider in order to test the robustness and performance of new methods?
The case is a novel classification algorithm for gene expression samples.
In the very least, you should do cross-validation (like leave-one-out-cross-validation) on a dataset. You can also apply the algorithm to other publicly available datasets (if they have metadata for the characteristic that you are trying to predict), which I think is a better test.
In both cases, you can use something like the ROCR package to create and ROC plot showing the tradeoffs between sensitivity and specificity. Creating a table with statistics like the positive predictive value and negative predictive value would also be nice. However, these are all relevant for binary variables - not sure if that is what you are trying to predict.
If you mean classification in a strict sense, i.e. supervised clusterisation of samples based on gene expression than the most basic things to do are:
Train your classifier with a relatively large negative and positive sets. Report precision and recall using cross-validation
Select positive and negative validation sets (more is better), ensure that those samples were not used during the training and report precision and recall of trained classifier on those sets
Second step is really critical to show that you're not over-fitting the data..
I mean if you have a binary classifier tells e.g. that a sample comes from a tumor or normal tissue, then positive set will be tumor expression datasets and negative sets will be normal expression datasets. Of course it all depends on what your classifier is meant to do..
The simplest way is to split the problem to several binary classification ones. So the positive set will be some sample type and the negative set will be comprised of other types. Note that positive sets should have a sufficient number of associated samples. For sample types characterized by few samples it will be better to leave them aside and then manually check if they are classified to a reasonable cluster. For accuracy measures for n-classification problem have a look at http://rali.iro.umontreal.ca/rali/sites/default/files/publis/SokolovaLapalme-JIPM09.pdf
I stopped believing results from any classifier built from high-dimensional input data (like gene expression data sets) unless results are shown to replicate on a completely independent data set, ideally done by another research group. Cross validation is a minimum must-have, but even with it there is just too much data massaging and overfitting going on.
So if you have access to an independent data set, use it to assess the performance of your classifier before publishing, but be honest and don't cheat and tune your classifier afterwards to improve results. I know this sounds harsh, but the field has been pleagued by unreproducible and non-replicable results for too long.
Thanks for the suggestions. I will try them.