Informational and computational limits of clustering Nati Srebro Many clustering formulations can be shown to be NP-hard, that is, it is hard to find the "optimal" clustering for a given data set. However, when there is no "real" clustering in that data, the optimal clustering is not of great interest. When there is a very distinct clustering in the data, and enough data is available, it is often rather easy to identify this clustering (this is based on both empirical and some theoretical results). The amount of data needed to be able to efficiently identify the clustering can, however, be greater than the amount of data statistically needed to identify it (if computational resources are not a concern). In this talk I will briefly describe work (with Sam Roweis and Gregory Shakhnarovich) on identifying the statistical and computational limits of clustering, i.e. how much data is needed to be able to statistically identify clusters in a source distribution, and how much extra data is needed to be able to do so efficiently. I will discuss how these questions are related to questions of deciding when an observed clustering is "real", or at least statistically significant.