Question

Finding Class-Discriminant Genes In A Microarray Experiment

6

Entering edit mode

14.5 years ago

Michael Schubert ★ 7.1k

I have a microarray expriment (Affymetrix) with 50 samples (each belonging to one of 5 disease subtypes) and about 5000 genes after filtering. Now, I would like to find a list of, let's say, 20 genes whose expression can be used to correctly classify the disease subtypes.

For now, I have created a design matrix in R that defines which subtype a sample belongs to. I then create a contrast matrix to distinguish between class A and the other classes, use eBayes to get the p-values, adjust them for multiple testing:

contrast.matrix = makeContrasts(A-(B+C+D+E)/4, levels=design)
fit = eBayes(contrasts.fit(lmFit(myFilteredGenes, design), contrast.matrix))

This is repeated for each class, so I end up with a 5000x5 matrix of p-values (where one column corresponds to how significant the current class expression compared to the average of all others is). How can I use this matrix to extract my class-discriminant genes? Or [preferred] how can I apply statistical testing on all classes instead of just 2 at a time?

Note: this is (1) for all classes at once (I'm not searching for pairwise comparison) and (2) taking the 4 highest p-values for each class might not be the smartest approach (or is it?).

microarray bioconductor r statistics • 4.0k views

ADD COMMENT • link updated 14.5 years ago by Laurent Gautier ▴ 810 • written 14.5 years ago by Michael Schubert ★ 7.1k

score 5 · Answer 1 · 2010-11-24

Finding "class discriminant genes" is sometimes referred to as "feature selection". A search engine will point you to a number of references.

RandomForests (recommended by Sean) are rather well suited for the task with the note that when building a classifier it can be really tricky to infer a mechanistic/biological explanation to it (you can also use it to perform a feature selection). For a reference in the context of microarrays, there is http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1363357/

Prior to all that, and in order to further reduce the number of features, you can also study the correlation structure for your 5000 genes and decide to group the strongly correlated ones together (the resulting groups are sometimes called "meta-genes").

Ram · Answer 2 · 2010-11-23

3

Entering edit mode

14.5 years ago

David Quigley 11k

If you're willing to consider non-parametric approaches, there is a large literature on this topic in machine learning. If you're not familiar with them, you might consider applying k-nearest-neighbor approaches (e.g. the class package in CRAN) or PAM. KNN would be particularly applicable if you know a priori that you have 5 phenotype groups. Most of these packages have cross-validation built in, as well as tools to identify features with the greatest contribution to the predictor.

As an aside, it is unlikely that you are sufficiently powered to build a classifier that will stand up to testing with an independent test data set, but that depends on the signal present in your data.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.5 years ago by David Quigley 11k

3

Entering edit mode

I often use randomForests for this type of thing, though I cannot say that I have objective evidence that it is better than any other. It gives ranked gene lists as output (importance) and can also give sample weights for each class.

ADD REPLY • link 14.5 years ago by Sean Davis 27k

0

Entering edit mode

I'm currently classifying with SVMs on the whole filtered set, if there is a way to read the predictor strengths from the model this would already be great - I'll look into it, thanks.

ADD REPLY • link 14.5 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

SVM weight extraction: http://stackoverflow.com/questions/1899008/weights-from-linear-svm-model-in-r

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 14.5 years ago by Michael Schubert ★ 7.1k