Kernel Method for Two-Sample Problem

About

This package contains a Matlab implementation of a kernel-based statistical hypothesis test for the two-sample problem, as described in GreEtAl07a and GreEtAl07b.

We propose to test whether distributions P and Q are different on the basis of samples drawn from each of them, by finding a smooth function (the witness function) which is large on the points drawn from P, and small (as negative as possible) on the points from Q. We use as our test statistic the difference between the mean function values on the two samples, or maximum mean discrepancy (MMD): when this is large, the samples are likely from different distributions. Smoothness is enforced by restricting the witness function to a unit ball in a reproducing kernel Hilbert space.

Three strategies are used to calculate the test threshold:

Moment matching using Pearson curves fits Pearson curves to the first three moments (and uses a lower bound on the fourth). It is the slowest alternative, but more accurate at small sample sizes (roughly speaking, less than 100 points from each of P and Q, but this depends on the distributions). Requires the Matlab statistics toolbox.
Bootstrap uses bootstrap resampling on the aggregated data to obtain a test threshold. It is faster than moment matching, and performs equally well at large sample sizes.
Large deviation bound uses a large-deviation bound to provide a test with non-asymptotic distribution-free guarantees of performance. In practice, however, the resulting test is too conservative, and does less well than either of the approaches above. It is included here to permit the reproduction of our results in GreEtAl07a.

Note that an earlier version of this test was proposed in BorEtAl06, however the current test more accurately estimates the null distribution, and should be used in preference to the earlier algorithm.

Code

The code may be downloaded here.

The archive contains two files: mmd.m is the main code, and U4thmoment.c contains additional optimised c-code for one of the test options. While the algorithm runs in standalone form, it is also possible to use it with the Spider machine learning toolbox. Code is written by Malte Rasch.

References

[GreEtAl07a]	Gretton, A., K. Borgwardt, M. Rasch, B. Schoelkopf and A. Smola: A Kernel Method for the Two-Sample-Problem. NIPS 2006. download
[GreEtAl07b]	Gretton, A., K. Borgwardt, M. Rasch, B. Schoelkopf and A. Smola: A Kernel Method for the Two-Sample-Problem. MPI Technical Report 157, 2007.
[BorEtAl06]	Borgwardt, K., A. Gretton, M. Rasch, H.-P. Kriegel, B. Schoelkopf and A. Smola: Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics 22(14), 1-9 (2006) download

Contact

arthur@tuebingen.mpg.de

A Kernel Method for the Two Sample Problem

by Arthur Gretton, Karsten Borgwardt, Malte Rasch,
Bernhard Schoelkopf, Alex Smola

About

Code

References

Contact

A Kernel Method for the Two Sample Problem

by Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schoelkopf, Alex Smola

About

Code

References

Contact

by Arthur Gretton, Karsten Borgwardt, Malte Rasch,
Bernhard Schoelkopf, Alex Smola