Course on Mathematical Foundations of Machine Learning
Winter 2007/08, Mathematisches Institut, Uni Tübingen
Ulrike von Luxburg
Overview
English announcement, list of topics and references:
pdf file.
Deutsche Ankündigung und Themenübersicht:
hier.
Literature
There is not a single book which covers all of the course. Here are
some books which might be helpful, more details will be given in the
individual lectures.
- L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recog-
nition. Springer, New York, 1996.
- T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning.
Springer, New York, 2001.
- V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New
York, 1995.
- B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge,
MA, 2002.
- A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical
Processes. Springer, New York, 1996.
- R. Dudley. Uniform central limit theorems. Cambridge
University Press, 1999.
-
- L. Györfi, M. Kohler, A. Krzyzak, and H. Walk. A distribution-free theory of
non-parametric regression. Springer, New York, 2002.
- L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation.
Springer, New York, 2001.
As a backup on probability theory:
- Shiryaev: Probability. Springer, 1996.
- Jacod, Protter: Probability essentials. Springer, 2004.
Script
Protokolle von allen Vorlesungen gibt es hier.
Summary of all lectures
Lecture 1 (17.10.07): Introduction
- Overview: what is machine learning? Examples, terms and
conditions, ideas, ...
- Formal setup of binary classification
- Bayes classifier: definition and examples
Literature: Devroye/Györfi/Lugosi: Probabilistic theory of pattern
recognition (first pages of Sec. 2)
Lecture 2 (24.10.07): Formalization of binary classification and
consistency in a probabilistic setting
- Recap: joint distributions, marginal distributions, conditional
distributions, conditional expectations
- Definition and properties of the regression function
- Can decompose joint distribution of (X,Y) in marginal P_X and
regression function
- Theorem: The Bayes classifier is optimal.
- Definition: different notions of consistency of classifiers
Literature:
Devroye/Györfi/Lugosi: Probabilistic theory of pattern
recognition (Sec. 2; Beginning of Sec. 6)
Vapnik: Nature of statistical learning theory. (Beginning of Sec. 2)
For background reading in probability theory:
Shiryaev: Probability
Jacod/Protter: Probability essentials
Lecture 3 (31.10.07): Consistency of plug-in classifiers
- Recap: types of convergence of random variables
- Consistency: examples and counterexamples
- Plug-in classifiers: definition, first bounds on the risk
- Theorem: Plug-in classifiers based on partitioning rules are
consistent.
Literature: Devroye/Györfi/Lugosi: Probabilistic theory of pattern
recognition (Sec. 6)
Lecture 4 (7.11.07): kNN classifier and the theorem of Stone
- Application of consistency theorem for plug-in classifiers:
cubic histogram rules
- The k-nearest neighbor classifier: definition
- Examples: The k-NN classifier for fixed k is not consistent.
- Theorem of Stone
Literature: Devroye/Györfi/Lugosi: Probabilistic theory of pattern
recognition (Sec. 6)
Lecture 5 (14.11.07): Consistency of kNN classifiers
- Theorem: The k-NN classifier is consistent if k grows
"suitably".
- Matlab demo of the kNN classifier. Demo can be downloaded here (9 MB, as
it contains the USPS data set...)
Literature: Devroye/Györfi/Lugosi: Probabilistic theory of pattern
recognition (Sec. 6)
Lecture 6 (28.11.07): Negative results: No free lunch theorems
- No free lunch theorems in a discrete setting: averaged over all
possible distributions, all classifiers perform the same.
- Arbitrary bad finite sample performance: For each classification rule there
exists a distribution with Bayes error 0, such that the risk of the
classifier on n training points
has error 1/2 - epsilon.
- Arbitrary slow rates of convergence: Even for a consistent
classifier there always exists a distribution such that the risk
converges arbitrarily slow.
Literature:
For the NFL theorems: Ho, Pepyne: Simple explanation of the No Free Lunch
Theorem and its implications. 2002.
Here is the pdf.
For the slow rates of convergence:
Devroye/Györfi/Lugosi: Probabilistic theory of pattern
recognition (Sec. 7)
Lecture 7 (5.12.07): Empirical risk minimization and concentration
inequalities
At this day, I will be on a conference in Canada. Instead,
Petra Philips will jump in and give the
lecture.
- Formalization of the ERM principle
- Decompositions of the risk (estimation/approximation,
bias/variance, ... )
- Concentration inequalities of Hoeffding and McDiarmid
- Generalization bound for a fixed function
- Generalization bound for a finite function class
Literature:
A nice overview paper is: Bousquet, Boucheron, Lugosi: Introduction to statistical learning
theory.
Here is the pdf.
Devroye/Györfi/Lugosi: Probabilistic theory of pattern
recognition (Sec. 12)
Vapnik: Nature of statistical learning theory. (Sec. 2)
Lecture 8 (9.1.2008): The standard generalization bounds using
shattering coefficients and VC dimension
- The shattering coefficient (growth function)
- Generalization bound using the shattering coefficient
- The VC dimension: definition and many examples
Literature: As last lecture.
Lecture 9 (16.1.2008): Generalization bounds using VC
dimension, sample compression bounds; Support Vector Machines
- The Sauer-Shelah lemma
- Generalization bound using the VC dimension
- [Sample compression bounds, skipped
for time reasons, for references see below
- Support vector machines: large margin classifier, hard and soft
margin SVM (primal)
Literature:
A nice proof of the Sauer-Shelah lemma can be found in
Shai Ben-David's lecture notes (Section 5).
The sample compression bound has been published in
Warmuth/Littlestone: Relating data compression and learnability , a
very short summary of the 0-error case can be found in Section 4.4
of
Cristianini/Shawe-Taylor: Support vector machines.
The PAC-compression bound is due to Ralf Herbrich,
here.
Support vector machines are treated in several books, for example:
Schölkopf and Smola: Learning with Kernels.
Lecture 10 (23.1.2008): Basics of convex optimization and the dual
SVM formulations
- Convex optimization: basic definitions, duality theory,
Lagrangian, duality gap, strong duality, constraint qualifications ...
- Derivation of the dual problems of hard and soft margin SVM
Literature:
For convex optimization: Boyd/Vandenberghe: Convex Optimziation,
mainly Chapter 5.
Book
available online as pdf.
For SVMs:
Schölkopf and Smola: Learning with Kernels, Chapter 6
(Optimization theory) and Chapter 7 (SVMs)
Lecture 11 (6.2.2008): Kernels and non-linear SVMs
- Recap: Hilbert spaces
- Kernels and the reproducing kernel Hilbert space RKHS
- The kernel trick, SVM with kernels
- Representer theorem
- Implementation, demos, and software packages
Literature:
A gentle introduction to RKHS can be found in Chapter 2 of
Schölkopf and Smola: Learning with Kernels.
An extensive overview over properties of RKHS used in machine learning
is in
M. Hein and O. Bousquet: Kernels,
Associated Structures and Generalizations. 2004
Examples for the construction of kernels on different types of
input data see Chapters 9 - 12 of
Shawe-Taylor, Cristianini: Kernel methods for pattern analysis
and Chapter 13 of
Schölkopf and Smola: Learning with Kernels.
Software:
A nice online demo of SVMs can be found
here
or here.
here.
A very general and good SVM implementation with interfaces from many
common languages (matlab, python, R, C, ...) is
libSVM by
Chih Jen Lin. This is the one I recommend to use in the beginning.
Lecture 12 (13.2.2008): Generalization bounds and consistency of SVMs
At the end of this lecture, I will also give a short overview over the
MPI in Tuebingen and what is going on in machine learning.
- Generalization bounds for SVMs: Large margin bounds, sparsity
bounds, trace bound,...
- Universal kernels and consistency of the SVM
- Regularization
Literature:
For the large margin and sparsity bound: e.g.
Bartlett, Shawe-Taylor:
Generalization performance of Support Vector Machines and other
Pattern Classifiers. 1998
For the trace bound:
Bartlett and Mendelson: Rademacher and Gaussian
Complexities: Risk Bounds and Structural Results. 2001.
For universal kernels and consistency of the SVM:
I. Steinwart, On the Influence of the Kernel on the Consistency of
Support Vector Machines. Journal of Machine Learning Research, Vol. 2,
pp. 67-93, 2001
For consistency of regularization in general:
I. Steinwart, Consistency of Support Vector Machines and other Regularized Kernel Machines. IEEE Transactions on Information Theory, Vol. 51, pp. 128-142, 2005.