Course on Mathematical Foundations of Machine Learning

Winter 2007/08, Mathematisches Institut, Uni Tübingen

Ulrike von Luxburg



Overview

English announcement, list of topics and references: pdf file.
Deutsche Ankündigung und Themenübersicht: hier.

Literature

There is not a single book which covers all of the course. Here are some books which might be helpful, more details will be given in the individual lectures. As a backup on probability theory:

Script

Protokolle von allen Vorlesungen gibt es hier.

Summary of all lectures

Lecture 1 (17.10.07): Introduction
Literature: Devroye/Györfi/Lugosi: Probabilistic theory of pattern recognition (first pages of Sec. 2)

Lecture 2 (24.10.07): Formalization of binary classification and consistency in a probabilistic setting
Literature:
Devroye/Györfi/Lugosi: Probabilistic theory of pattern recognition (Sec. 2; Beginning of Sec. 6)
Vapnik: Nature of statistical learning theory. (Beginning of Sec. 2)

For background reading in probability theory:
Shiryaev: Probability
Jacod/Protter: Probability essentials


Lecture 3 (31.10.07): Consistency of plug-in classifiers
Literature: Devroye/Györfi/Lugosi: Probabilistic theory of pattern recognition (Sec. 6)

Lecture 4 (7.11.07): kNN classifier and the theorem of Stone
Literature: Devroye/Györfi/Lugosi: Probabilistic theory of pattern recognition (Sec. 6)

Lecture 5 (14.11.07): Consistency of kNN classifiers
Literature: Devroye/Györfi/Lugosi: Probabilistic theory of pattern recognition (Sec. 6)

Lecture 6 (28.11.07): Negative results: No free lunch theorems
Literature:
For the NFL theorems: Ho, Pepyne: Simple explanation of the No Free Lunch Theorem and its implications. 2002. Here is the pdf.
For the slow rates of convergence: Devroye/Györfi/Lugosi: Probabilistic theory of pattern recognition (Sec. 7)

Lecture 7 (5.12.07): Empirical risk minimization and concentration inequalities
At this day, I will be on a conference in Canada. Instead, Petra Philips will jump in and give the lecture.
Literature:
A nice overview paper is: Bousquet, Boucheron, Lugosi: Introduction to statistical learning theory. Here is the pdf.
Devroye/Györfi/Lugosi: Probabilistic theory of pattern recognition (Sec. 12)
Vapnik: Nature of statistical learning theory. (Sec. 2)

Lecture 8 (9.1.2008): The standard generalization bounds using shattering coefficients and VC dimension
Literature: As last lecture.

Lecture 9 (16.1.2008): Generalization bounds using VC dimension, sample compression bounds; Support Vector Machines
Literature:
A nice proof of the Sauer-Shelah lemma can be found in Shai Ben-David's lecture notes (Section 5).
The sample compression bound has been published in Warmuth/Littlestone: Relating data compression and learnability , a very short summary of the 0-error case can be found in Section 4.4 of Cristianini/Shawe-Taylor: Support vector machines.
The PAC-compression bound is due to Ralf Herbrich, here.
Support vector machines are treated in several books, for example: Schölkopf and Smola: Learning with Kernels.

Lecture 10 (23.1.2008): Basics of convex optimization and the dual SVM formulations
Literature:
For convex optimization: Boyd/Vandenberghe: Convex Optimziation, mainly Chapter 5. Book available online as pdf.
For SVMs: Schölkopf and Smola: Learning with Kernels, Chapter 6 (Optimization theory) and Chapter 7 (SVMs)

Lecture 11 (6.2.2008): Kernels and non-linear SVMs
Literature:
A gentle introduction to RKHS can be found in Chapter 2 of Schölkopf and Smola: Learning with Kernels.
An extensive overview over properties of RKHS used in machine learning is in M. Hein and O. Bousquet: Kernels, Associated Structures and Generalizations. 2004
Examples for the construction of kernels on different types of input data see Chapters 9 - 12 of Shawe-Taylor, Cristianini: Kernel methods for pattern analysis and Chapter 13 of Schölkopf and Smola: Learning with Kernels.

Software: A nice online demo of SVMs can be found here or here. here.
A very general and good SVM implementation with interfaces from many common languages (matlab, python, R, C, ...) is libSVM by Chih Jen Lin. This is the one I recommend to use in the beginning.

Lecture 12 (13.2.2008): Generalization bounds and consistency of SVMs
At the end of this lecture, I will also give a short overview over the MPI in Tuebingen and what is going on in machine learning.

Literature:
For the large margin and sparsity bound: e.g. Bartlett, Shawe-Taylor: Generalization performance of Support Vector Machines and other Pattern Classifiers. 1998
For the trace bound: Bartlett and Mendelson: Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. 2001.
For universal kernels and consistency of the SVM:
I. Steinwart, On the Influence of the Kernel on the Consistency of Support Vector Machines. Journal of Machine Learning Research, Vol. 2, pp. 67-93, 2001
For consistency of regularization in general:
I. Steinwart, Consistency of Support Vector Machines and other Regularized Kernel Machines. IEEE Transactions on Information Theory, Vol. 51, pp. 128-142, 2005.