Untitled Document

2. Vorlesung: Similarity and Dissimilarity Functions, Embeddings, and Multidimensional Scaling

Keywords and background reading (note that some of the references are quite advanced!):

General definitions:

Similarities, dissimilarities and their properties (summary p.13-16 in Luxburg04, and references therein)
Metrics, norms, scalar products, kernels (see any functional analysis book)
Metric spaces, Banach spaces, Euclidean spaces, Hilbert spaces (see any functional analysis book)

Examples for similarity and dissimilarity functions:

distances between "numbers" and real vectors: L_p norms (functional analysis books)
distances between texts: bag of words; zip distance (ChenLiMaVitany04 )
distances between strings: edit distance (see here); string kernels (p.412 of Schölkopf and Smola: Learning with Kernels, MIT Press, 2002; the original paper: LodhiSaundersShawe-TaylorCristianinWatkins02

[Here are some more examples I did not talk about in the lecture:

distances between graphs or other strucutred objects: convolution kernels (Haussler, link to do ), label sequence kernels (KashimaTsudaInokuchi04)
distances between probability distributions: Kullback-Leibler (any information theory book), Correlation coefficient (any statistics book)

]

Multidimensional Scaling: Embedding of data points in a Euclidean space.

There are at least two complete books on this topic: Cox and Cox: Multidimensional Scaling. Chapman & Hall, 2001; Borg and Groenen: Modern Multidimensional scaling. Springer, 2005.

Here is are some matlab demos to see how it works:

a very simple implementation: demo_mds.m.
matlab's own implementation (subject to copyright!!!): cmdscale.m

[Other embeddings we did not not talk about in the lecture:

Embedding finite metric spaces into L_p spaces: check out the papers on the homepage of Jiri Matousek
an infinite metric space into a Hilbert space: Schönbergs results (the original papers by Schönberg: On certain metric spaces arising from Euclidean spaces by a change of metric and their embedding in Hiilbert space. Ann of Math. 38,4,787-793,1938. And: Metric Spaces and Positive Definite Functions, TAMS 44, 522-536, 1938.
an arbitary metric space into a Banach space: Kuratowski embedding (see HeinBousquet03) and Arens-Eells embedding (see LuxburgBousquet03).
A similar theorem to Johnson-Lindenstrauss, but for arbitrary metrics: Bourgain's theorem.

]

Puzzles and exercises:

We have seen that MDS only works if the distance matrix is Euclidean. How can we test whether a distance matrix is Euclidean?
What goes wrong in the MDS algorithm if the distance matrix is not Euclidean?
What can we do to approximate MDS in case the matrix is not Euclidean?
Prove that in general we embed in a (n-1)-dimensional space (where n is the number of data points). Why not n-dimensional?
If the original data lies in an k-dimensional space (k < n) the image also will be a k-dim subspace of R^n. Why? How can we see this in the algorithm?
We have seen that we are free to choose the origin of the coordinate system, and many people choose the center of the data as origin, that is 0 = 1/n sum x_i. Let S = (<x_i,x_j>) the scalar product matrix of n points x_1,...,x_n in R^n. Now shift the coordinate system to have the center of the data as origin. How does S now look like?