Clustering With Point-Wise Error Estimates
2
1
Entering edit mode
12.0 years ago
Andrew Su 4.9k

I have a typical clustering problem with a twist. Imagine a standard data matrix with samples along one axis, features along the second axis, and numeric values in each cell. This is pretty much a data matrix one would generate from a gene expression profiling experiment. I would like to cluster both samples and features by similarity, except I also want to incorporate a second matrix that reflects the error estimates for each individual value in my data matrix. Basically my data matrix has high confidence measurements and low confidence measurements, and I want them to be weighted appropriately in the clustering.

I can imagine an algorithm that weights the calculation of the distance metric by the combined error values, and I can also imagine an approach to pool error estimates when combining nodes. My question is whether software to do this calculation already exists. Leads appreciated...

clustering error statistics • 2.8k views
ADD COMMENT
1
Entering edit mode
12.0 years ago
matted 7.8k

There are many ways to do this, depending on what assumptions you'd like to make or what other structure you impose on the problem.

The method that jumps to my mind is extending a Gaussian mixture model to handle your additional modelling layer. A principled probabilistic approach like this can handle new observed data types easily, whereas other heuristic clustering approaches (e.g. various agglomerative techniques) would be more confusing.

In your problem, you augment each observation with an error estimate. If we assume the measurement errors are normal with zero mean, everything is great because of the nice properties of normal distributions. The E-step will increase the variance in the class-conditional density for each data point by its known measurement error. The M-step will weight data points by precision (from the measurement error) when computing new class-conditional means and variances. This will have the nice property of downweighting noisy measurements, in the proper way (under the model assumptions).

If your question was specifically about existing software, then I don't know any prepackaged things that do exactly this. But hopefully this is helpful if you'd like to roll your own or extend an existing GMM EM library (which I think should be straightforward).

ADD COMMENT
0
Entering edit mode

Thank you, yes, helpful. Though admittedly I was looking for a lighter-weight solution that would essentially be a hack on simple hierarchical clustering. But I can see how a GMM EM solution would be a more elegant way of handling errors...

ADD REPLY
0
Entering edit mode
12.0 years ago
Josh Herr 5.8k

I like USEARCH. Using UCLUST/UCHIME, you can tweak parameters to allow for sequencing errors (see parameter tuning). I'm not sure if it will work for you, but it's worth a look.

Sorry, I was assuming you have sequence data in your matrix, but on second read, perhaps you don't. Do you have an analysis preference, say in R?

ADD COMMENT
0
Entering edit mode

Apologies, I was referring to clustering in a numeric data matrix, not sequencing data. I've edited my question above to hopefully make that clear. Analysis platform can be anything -- R would probably be ideal though...

ADD REPLY

Login before adding your answer.

Traffic: 1794 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6