An MDL Framework for Data Clustering

Petri Myllymaki
Complex Systems Computation Group (CoSCo)
Helsinki Institute for Information Technology (HIIT)
University of Helsinki & Helsinki University of Technology
P.O. Box 9800, FIN-02015 HUT, Finland.
petri.myllymaki@cs.helsinki.fi
http://cosco.hiit.fi/

Clustering is one of the central concepts in the field of unsupervised
data analysis. Unfortunately it is also a very controversial issue,
and the very meaning of the concept “clustering” may vary a great deal
between different scientific disciplines. However, a common goal in
all cases is that the objective is to find a structural representation
of data by grouping (in some sense) similar data items together.

In this work we want to distinguish the actual process of grouping the
data items from the more fundamental issue of defining a criterion for
deciding which data items belong together, and which do not. To study
the latter problem further, we regard clustering as a data assignment
problem where the goal is to partition the data into several
non-hierarchical groups of items. For solving this problem, we suggest
an information-theoretic framework based on Rissanen's minimum
description length (MDL) principle.

Intuitively, the idea is that we group together those data items that
can be compressed well together, so that the total code length over
all the data groups is optimized. One can argue that as efficient
compression is possible only when one has discovered underlying
regularities that are common to all the members of a group, this
approach produces an implicitly defined similarity metric between the
data items. Formally the global code length criterion to be optimized
is defined by using the intuitively appealing universal normalized
maximum likelihood (NML) code which has been shown to produce optimal
compression rate in an explicitly defined manner. The number of groups
can be assumed to be unknown, and the problem of deciding the optimal
number is formalized as part of the same theoretical framework.

As a concrete example of the suggested theoretical framework, we
instantiate the general data clustering approach for the case with
discrete variables and a local independence assumption between the
variables, and present a recursive formula for efficient computation
of the NML code length in this case. The result is of practical
importance as the amount of discrete data is increasing rapidly (in
the form of WWW pages, WWW log data, questionnaires, and so
on). Although the approach can be easily extended to more complex
cases, we argue that the local independence model is important as the
resulting clusters are in this case easy to analyze. It can also be
said that the local independence model assumed here is complex enough,
as one can obviously model arbitrarily complex distributions by adding
more and more clusters. Finally, we present empirical results that
demonstrate the validity of the suggested approach.