Informational and computational limits of clustering
Nati Srebro
Many clustering formulations can be shown to be NP-hard, that is, it
is hard to find the "optimal" clustering for a given data set.
However, when there is no "real" clustering in that data, the optimal
clustering is not of great interest. When there is a very distinct
clustering in the data, and enough data is available, it is often
rather easy to identify this clustering (this is based on both
empirical and some theoretical results). The amount of data needed to
be able to efficiently identify the clustering can, however, be
greater than the amount of data statistically needed to identify it
(if computational resources are not a concern).
In this talk I will briefly describe work (with Sam Roweis and Gregory
Shakhnarovich) on identifying the statistical and computational limits
of clustering, i.e. how much data is needed to be able to
statistically identify clusters in a source distribution, and how much
extra data is needed to be able to do so efficiently. I will discuss
how these questions are related to questions of deciding when an
observed clustering is "real", or at least statistically significant.