Determining Most Diagnostic Differentially Expressed Genes
2
2
Entering edit mode
11.0 years ago

I am working on a differential expression analysis, and would like to explore the genes that are the most informative for the contrast I'm studying. I have sorted the genes by magnitude of fold change (1/fold change if < 0) and looked at the top 10, but highest fold change does not necessarily mean most diagnostic between the 2 conditions. I'm reading an older paper that used leave-one-out cross-validation class prediction to identify the most predictive genes in a differential expression analysis. However, the software they cite requires an expensive paid license, and no doubt has changed significantly in 10 years.

What software is available for leave-one-out cross-validation class prediction these days? Can you provide a simple example usage? My preference would be an R-based solution, but I'm flexible on that point.

rna-seq differential-expression • 4.1k views
ADD COMMENT
2
Entering edit mode
11.0 years ago
B. Arman Aksoy ★ 1.2k

Not sure about the out-of-the-box solution, but if I were you, I would try to go with some basic machine learning method, such as LASSO, Elastic Net (glmnet) or Random Forest (randomForest), and extract the most important features for classification using a fold-based validation:

  1. Using CVTools in R, partition your data (training and test)
  2. Train your model on the training set (X = expression matrix, y = classification vector)
  3. With your trained model, try to explain the test data and see how it performs
  4. Repeat this for different folds (i.e. different partitions)

If you think the predictions look reasonable, you can extract the features easily from your fit and use them for new classifications.

ADD COMMENT
0
Entering edit mode

I agree, and also the R randomForest packages has built-in feature importance estimation which is based on some sort of cross-validation. You might also want to try a random GLM (http://www.biomedcentral.com/1471-2105/14/5) as a promising mix of GLMs and RFs.

ADD REPLY
2
Entering edit mode
11.0 years ago

I ended up using the Weka data mining software. I wrote this script to convert the data to ARFF format, and then I did feature selection using information gain. As I have 11 samples, I set it to do 11-fold cross-validation, which is leave-one-out cross-validation. To confirm, I extracted only the top 25 features (genes) and loaded only these into Weka and got near-perfect classification using several classification methods.

ADD COMMENT

Login before adding your answer.

Traffic: 1734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6