NIPS 2005 workshop on

Theoretical Foundations of Clustering


Organizers: Shai Ben-David, Ulrike von Luxburg, John Shawe-Taylor and Naftali Tishby


Clustering is one of the most widely used techniques for exploratory data analysis. Across all disciplines, from social sciences over biology to computer science, people try to get a first intuition about their data by identifying meaningful groups among the data points. In the past five decades, a wide variety of clustering algorithms have been developed and applied to a wide range of practical problems. Despite this large number of algorithms and applications, the goal of clustering and its proper interpretation remains fuzzy and vague. There are in fact many different problems that are clustered together under this single term, from quantization with low distortion for compression, through various techniques for graph partitioning whose goals are not fully specified, to methods for revealing hidden structure and unobserved features in complex data. We are clearly not talking about a single well defined problem. Moreover, the theoretical foundations of clustering seem to be distressingly meager, covering only some sub-domains and failing to address some of the most basic general aspects of the area. There is not even an agreement among the researchers on the correct questions to pose, let alone which tools and analysis techniques should be used to answer those questions. In our opinion there is an urgent need to initiate a concerted discussion on these issues, in order to move towards a consolidation of the theoretical basis for - at least some of the aspects of - clustering. One prospective benefit of building a theoretical framework for clustering may come from enabling the transfer of tools developed in other related domains, such as machine learning and information theory, where the usefulness of having a general mathematical framework have been impressively demonstrated. We have the impression that recently many researchers have become aware of this need and agree on the importance of these issues.

Questions we wish to address:

  1. What is clustering? How can it be defined and how can we sort the different types of clustering and their goals? In particular:
  2. How should prior knowledge be encoded? As a pair-wise similarity/distance function over domain points? As a set of relevant features? Should data be embedded in some richer structure (Hilbert space, topology) ?
  3. Is there a principled way to measure the quality of a clustering on particular data set?
  4. Is there a principled way to measure the quality of a clustering algorithm?
  5. What are principled and meaningful ways of measuring the similarity (or degree of agreement) between different clusterings?
  6. Can one distinguish “clusterable” data from “structureless” data?
  7. What are the tools we should try to import from other areas such as classification prediction, density estimation, data compression, computational geometry, other relevant areas?

Structure of the workshop:

Our one-day workshop will consist of invited talks, contributed talks, and extensive discussions.

Confirmed invited speakers:

Call for contributions:

We invite presentations addressing one or several of the questions raised above. To keep the workshop lively we intend to keep the individual presentations short, at most 15 minutes.We welcome presentations about work in all "stages of completion", ranging from completed work over work in progress to discussing potential directions of future research. In particular we encourage position papers. We would like to stress that the focus of this workshop is on foundations of clustering. We are not interested in contributions about "yet another ad-hoc clustering algorithm".

Please submit an extended abstract (at most two pages) summarizing your potential contribution to

*** The deadline is the 30th October 2005. ***

The organizers will review all submissions. You will be notified by November 11 whether your contribution is accepted.