New Frontiers in Model Order Selection

NIPS-2011 Workshop

December 16, Sierra Nevada, Spain


Video recording of the workshop is available at:


Model order selection, which is a trade-off between model resolution and its statistical reliability, is one of the fundamental questions in machine learning. It was studied in detail in the context of supervised learning with i.i.d. samples, but received relatively little attention beyond this domain. The goal of our workshop is to raise attention to the question of model order selection in other domains, share ideas and approaches between the domains, and identify perspective directions for future research. Our interest covers ways of defining model complexity in different domains, examples of practical problems, where intelligent model order selection yields advantage over simplistic approaches, and new theoretical tools for analysis of model order selection. The domains of interest span over all problems that cannot be directly mapped to supervised learning with i.i.d. samples, including, but not limited to, reinforcement learning, active learning, learning with delayed, partial, or indirect feedback, and learning with submodular functions.

An example of first steps in defining complexity of models in reinforcement learning, applying trade-off between model complexity and empirical performance, and analyzing it can be found in [1-4]. An intriguing research direction coming out of these works is simultaneous analysis of exploration-exploitation and model order selection trade-offs. Such an analysis enables to design and analyze models that adapt their complexity as they continue to explore and observe new data. Potential practical applications of such models include contextual bandits (for example, in personalization of recommendations on the web [5]) and Markov decision processes.

[1] N. Tishby, D. Polani. "Information Theory of Decisions and Actions", Perception-Reason-Action Cycle: Models, Algorithms and Systems, 2010.
[2] J. Asmuth, L. Li, M. L. Littman, A. Nouri, D. Wingate, "A Bayesian Sampling Approach to Exploration in Reinforcement Learning", UAI, 2009.
[3] N. Srinivas, A. Krause, S. M. Kakade, M. Seeger, "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design", ICML, 2010.
[4] Y. Seldin, P. Auer, F. Laviolette, J. Shawe-Taylor, R. Ortner, "PAC-Bayesian Analysis of Contextual Bandits", NIPS, 2011.
[5] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire, "Contextual Bandit Algorithms with Supervised Learning Guarantees", AISTATS, 2011.


Yevgeny Seldin, Max Planck Institute for Intelligent Systems
Koby Crammer, Technion
Nicolò Cesa-Bianchi, University of Milano
François Laviolette, Université Laval
John Shawe-Taylor, University College London

Invited Speakers

Shie Mannor, Technion
Peter Auer, Montanuniversität Leoben
John Langford, Yahoo!
Naftali Tishby, The Hebrew University


  7:50 - 8:00 Opening remarks
  8:00 - 8:45 Shie Mannor - "Model Selection in Markovian Processes"
  8:45 - 9:05 Amir-massoud Farahmand - "BErMin: A Model Selection Algorithm for Reinforcement Learning Problems"
  9:05 - 9:15 Break
  9:15 - 9:35 Odalric-Ambrym Maillard - "Selecting the State-Representation in Reinforcement Learning"
  9:35 - 10:20 Peter Auer - "Autonomous Exploration in Reinforcement Learning"
10:20 - 16:00 Break
16:00 - 16:45 John Langford - "Model Selection when Learning with Exploration"
16:45 - 17:45 Posters
17:45 - 18:30 Naftali Tishby - "Future Information Minimization as PAC-Bayes Regularization in Reinforcement Learning"
18:30 - 19:30 Panel Discussion


Amir-massoud Farahmand and Csaba Szepesvári - "BErMin: A Model Selection Algorithm for Reinforcement Learning Problems" - to appear in Machine Learning Journal.

Aurélie Boisbunon - "Criteria for variable selection with dependence"

Marina Sapir - "Bias Plus Variance Decomposition for Survival Analysis Problems"

Zhou Bai and Stefan C. Kremer - "Using Deep-belief Networks for Model Order Selection: A Case study on DNA Sequence Classification"

Odalric-Ambrym Maillard, Rémi Munos, and Daniil Ryabko - "Selecting the State-Representation in Reinforcement Learning" - In NIPS, 2011.

Mohammad Ghavamzadeh, Alessandro Lazaric, Rémi Munos, and Matthew Hoffman - "Finite-Sample Analysis of Lasso-TD" - In ICML, 2011.

S. Clémençon, N. Baskiotis, and N. Usunier - "Calibrating SVMs with V-fold penalization"

Morteza Haghir Chehreghani, Alberto Giovanni Busetto, and Joachim M. Buhmann - "Cluster Model Validation by Maximizing Approximation Capacity"

Yuri Grinberg, Mahdi Milani Fard, and Joelle Pineau - "LSTD on Sparse Spaces"

Alexandre Lacoste, François Laviolette, and Mario Marchand - "Comparing Learning Algorithms on Multiple Datasets"

Yevgeny Seldin, Peter Auer, François Laviolette, John Shawe-Taylor, and Ronald Ortner - "PAC-Bayesian Analysis of Contextual Bandits" - In NIPS, 2011.



Talk Abstracts

Shie Mannor (Technion) - "Model Selection in Markovian Processes"
We address the problem of how to use a sample of trajectories to choose from a candidate set of possible state spaces in different types of Markov processes. Standard approaches to solving this problem for static models use penalized maximum likelihood criteria that take the likelihood of the trajectory into account. Surprisingly, these criteria do not work even for simple fully observable finite Markov processes. We propose an alternative criterion and show that it is consistent. We then provide a guarantee on its performance with finite samples and illustrate its accuracy using simulated data and real-world data. We finally address the question of model selection in Markov decision processes, where the decision maker can actively select actions to assist in model selection.

Amir-massoud Farahmand (McGill University) - "BErMin: A Model Selection Algorithm for Reinforcement Learning Problems"
We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the goal is to find an action-value function with the smallest Bellman error among a countable set of candidate functions. We propose a complexity regularization-based model selection algorithm, BErMin, and prove that it enjoys an oracle-like property: the estimator's error differs from that of an oracle, who selects the candidate with the minimum Bellman error, by only a constant factor and a small remainder term that vanishes at a parametric rate as the number of samples increases.

Joint work with Csaba Szepesvári.

Odalric-Ambrym Maillard (Montanuniversität Leoben) - "Selecting the State-Representation in Reinforcement Learning"
The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T^2/3 where T is the horizon time.

Joint work with Rémi Munos and Daniil Ryabko.

Peter Auer (Montanuniversität Leoben) - "Autonomous Exploration in Reinforcement Learning"
One of the striking differences between current reinforcement learning algorithms and early human learning is that animals and infants appear to explore their environments with autonomous purpose, in a manner appropriate to their current level of skills. For analysing such autonomous exploration theoretically, an evaluation criterion is required to compare exploration algorithms. Unfortunately, no commonly agreed evaluation criterion has been established yet. As one possible criterion, we consider in this work the navigation skill of a learning agent after a number of exploration steps. In particular, we consider how many exploration steps are required, until the agent has learned reliable policies for reaching all states in a certain distance from a start state. (Related but more general objectives are also of interest.)

While this learning problem can be addressed in a straightforward manner for finite MDPs, it becomes much more interesting for potentially infinite (but discrete) MDPs. For infinite MDPs we can analyse how the learning agent increases its navigation skill for reaching more distant states, as the exploration time increases. We show that an optimistic exploration strategy learns reliable policies when the number of exploration steps is linear in the number of reachable states and in the number of actions. The number of reachable states is not known to the algorithm, but the algorithm adapts to this number.

Joint work with Shiau Hong Lim and Chris Watkins.

John Langford (Yahoo! Research) - "Model Selection when Learning with Exploration"
I will discuss model selection in 4 settings: {Selective Sampling, Partial Feedback} x {Agnostic,Realizable}. In selective sampling, you choose on which examples to acquire a label. In partial feedback, you choose on which label (or action) to discover a reward (or loss). In the agnostic setting, your goal is simply competing a set of predictors. In the realizable setting, one of your predictors is perfect, for varying definitions of perfect.

Naftali Tishby (The Hebrew University of Jerusalem) -
"Future Information Minimization as PAC-Bayes Regularization in Reinforcement Learning"
Interactions between an organism and its environment are commonly treated in the framework of Markov Decision Processes (MDP). While standard MDP is aimed at maximizing expected future rewards (value), the circular flow of information between the agent and its environment is generally ignored. In particular, the information gained from the environment by means of perception and the information involved in the process of action selection are not treated in the standard MDP setting.  In this talk, we focus on the control information and show how it can be combined with the reward measure in a unified way. Both of these measures satisfy the familiar Bellman recursive equations, and their linear combination (the free-energy) provides an interesting new optimization criterion. The tradeoff between value and information, explored using our INFO-RL algorithm, provides a principled justification for stochastic (soft) policies. These optimal policies are also shown to be robust to uncertainties in the reward values by applying the PAC-Bayes generalization bound. The same PAC-Bayesian bounding term thus plays the dual roles of information-gain in the Information-RL formalism and as a model-order regularization term in the learning of the process.