Question

Clustering With Point-Wise Error Estimates

1

Entering edit mode

12.1 years ago

Andrew Su 4.9k

I have a typical clustering problem with a twist. Imagine a standard data matrix with samples along one axis, features along the second axis, and numeric values in each cell. This is pretty much a data matrix one would generate from a gene expression profiling experiment. I would like to cluster both samples and features by similarity, except I also want to incorporate a second matrix that reflects the error estimates for each individual value in my data matrix. Basically my data matrix has high confidence measurements and low confidence measurements, and I want them to be weighted appropriately in the clustering.

I can imagine an algorithm that weights the calculation of the distance metric by the combined error values, and I can also imagine an approach to pool error estimates when combining nodes. My question is whether software to do this calculation already exists. Leads appreciated...

clustering error statistics • 2.8k views

ADD COMMENT • link updated 12.1 years ago by matted 7.8k • written 12.1 years ago by Andrew Su 4.9k

score 1 · Answer 1 · 2012-11-30

There are many ways to do this, depending on what assumptions you'd like to make or what other structure you impose on the problem.

The method that jumps to my mind is extending a Gaussian mixture model to handle your additional modelling layer. A principled probabilistic approach like this can handle new observed data types easily, whereas other heuristic clustering approaches (e.g. various agglomerative techniques) would be more confusing.

In your problem, you augment each observation with an error estimate. If we assume the measurement errors are normal with zero mean, everything is great because of the nice properties of normal distributions. The E-step will increase the variance in the class-conditional density for each data point by its known measurement error. The M-step will weight data points by precision (from the measurement error) when computing new class-conditional means and variances. This will have the nice property of downweighting noisy measurements, in the proper way (under the model assumptions).

If your question was specifically about existing software, then I don't know any prepackaged things that do exactly this. But hopefully this is helpful if you'd like to roll your own or extend an existing GMM EM library (which I think should be straightforward).

score 0 · Answer 2 · 2012-11-30

0

Entering edit mode

12.1 years ago

Josh Herr 5.8k

I like USEARCH. Using UCLUST/UCHIME, you can tweak parameters to allow for sequencing errors (see parameter tuning). I'm not sure if it will work for you, but it's worth a look.

Sorry, I was assuming you have sequence data in your matrix, but on second read, perhaps you don't. Do you have an analysis preference, say in R?

ADD COMMENT • link 12.1 years ago by Josh Herr 5.8k

0

Entering edit mode

Apologies, I was referring to clustering in a numeric data matrix, not sequencing data. I've edited my question above to hopefully make that clear. Analysis platform can be anything -- R would probably be ideal though...

ADD REPLY • link 12.1 years ago by Andrew Su 4.9k