NIPS 2005 workshop on

Theoretical Foundations of Clustering

Organizers: Shai Ben-David, Ulrike von Luxburg, John Shawe-Taylor and Naftali Tishby

Background:
Clustering is one of the most widely used techniques for exploratory data analysis. Across all disciplines, from social sciences over biology to computer science, people try to get a first intuition about their data by identifying meaningful groups among the data points. In the past five decades, a wide variety of clustering algorithms have been developed and applied to a wide range of practical problems. Despite this large number of algorithms and applications, the goal of clustering and its proper interpretation remains fuzzy and vague. There are in fact many different problems that are clustered together under this single term, from quantization with low distortion for compression, through various techniques for graph partitioning whose goals are not fully specified, to methods for revealing hidden structure and unobserved features in complex data. We are clearly not talking about a single well defined problem. Moreover, the theoretical foundations of clustering seem to be distressingly meager, covering only some sub-domains and failing to address some of the most basic general aspects of the area. There is not even an agreement among the researchers on the correct questions to pose, let alone which tools and analysis techniques should be used to answer those questions. In our opinion there is an urgent need to initiate a concerted discussion on these issues, in order to move towards a consolidation of the theoretical basis for - at least some of the aspects of - clustering. One prospective benefit of building a theoretical framework for clustering may come from enabling the transfer of tools developed in other related domains, such as machine learning and information theory, where the usefulness of having a general mathematical framework have been impressively demonstrated. We have the impression that recently many researchers have become aware of this need and agree on the importance of these issues.

Questions we wish to address:

What is clustering? How can it be defined and how can we sort the different types of clustering and their goals? In particular:

Is the main purpose to use the partition to discover new features in the data?
Or the other way around, is the main purpose to simplify our data by building groups, thus getting rid of unimportant information?
Is clustering just data compression?
Is clustering just estimating modes of a density?
Is clustering related to human perception?
Can one come up with a meaningful taxonomy of clustering tasks?
Can we formulate the intuitive notion of "revealing hidden structure and properties"?

How should prior knowledge be encoded? As a pair-wise similarity/distance function over domain points? As a set of relevant features? Should data be embedded in some richer structure (Hilbert space, topology) ?
Is there a principled way to measure the quality of a clustering on particular data set?

Can every clustering task be expressed as an optimization of some explicit readily-computable associated objective cost function?
Can stability be considered a first principle for meaningful clustering?

Is there a principled way to measure the quality of a clustering algorithm?

Necessary conditions
Can we come up with sufficient conditions for reasonable clustering?
Stability conditions
Richness conditions
What type of performance guarantees can one hope to provide?

What are principled and meaningful ways of measuring the similarity (or degree of agreement) between different clusterings?
Can one distinguish “clusterable” data from “structureless” data?
What are the tools we should try to import from other areas such as classification prediction, density estimation, data compression, computational geometry, other relevant areas?

Structure of the workshop:

Our one-day workshop will consist of invited talks, contributed talks, and extensive discussions.

Confirmed invited speakers:

Avrim Blum
Joachim Buhmann
Joydeep Gosh
Marina Meila

Call for contributions:

We invite presentations addressing one or several of the questions raised above. To keep the workshop lively we intend to keep the individual presentations short, at most 15 minutes.We welcome presentations about work in all "stages of completion", ranging from completed work over work in progress to discussing potential directions of future research. In particular we encourage position papers. We would like to stress that the focus of this workshop is on foundations of clustering. We are not interested in contributions about "yet another ad-hoc clustering algorithm".

Please submit an extended abstract (at most two pages) summarizing your potential contribution to clustering_workshop@ipsi.fraunhofer.de.

*** The deadline is the 30th October 2005. ***

The organizers will review all submissions. You will be notified by November 11 whether your contribution is accepted.