
New Frontiers in Model Order SelectionNIPS2011 WorkshopDecember 16, Sierra Nevada, Spain

Video 
Video recording of the
workshop is available at: http://videolectures.net/nipsworkshops2011_model_order_selection 
Description 
Model order selection, which is a
tradeoff between model resolution and its statistical
reliability, is one of the fundamental questions in machine
learning. It was studied in detail in the context of
supervised learning with i.i.d. samples, but received
relatively little attention beyond this domain. The goal of
our workshop is to raise attention to the question of model
order selection in other domains, share ideas and approaches
between the domains, and identify perspective directions for
future research. Our interest covers ways of defining model
complexity in different domains, examples of practical
problems, where intelligent model order selection yields
advantage over simplistic approaches, and new theoretical
tools for analysis of model order selection. The domains of
interest span over all problems that cannot be directly
mapped to supervised learning with i.i.d. samples,
including, but not limited to, reinforcement learning,
active learning, learning with delayed, partial, or indirect
feedback, and learning with submodular functions. An example of first steps in defining complexity of models in reinforcement learning, applying tradeoff between model complexity and empirical performance, and analyzing it can be found in [14]. An intriguing research direction coming out of these works is simultaneous analysis of explorationexploitation and model order selection tradeoffs. Such an analysis enables to design and analyze models that adapt their complexity as they continue to explore and observe new data. Potential practical applications of such models include contextual bandits (for example, in personalization of recommendations on the web [5]) and Markov decision processes. References: [1] N. Tishby, D. Polani. "Information Theory of Decisions and Actions", PerceptionReasonAction Cycle: Models, Algorithms and Systems, 2010. [2] J. Asmuth, L. Li, M. L. Littman, A. Nouri, D. Wingate, "A Bayesian Sampling Approach to Exploration in Reinforcement Learning", UAI, 2009. [3] N. Srinivas, A. Krause, S. M. Kakade, M. Seeger, "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design", ICML, 2010. [4] Y. Seldin, P. Auer, F. Laviolette, J. ShaweTaylor, R. Ortner, "PACBayesian Analysis of Contextual Bandits", NIPS, 2011. [5] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire, "Contextual Bandit Algorithms with Supervised Learning Guarantees", AISTATS, 2011. 
Organizers 
Yevgeny Seldin,
Max Planck Institute for Intelligent Systems Koby Crammer, Technion Nicolò CesaBianchi, University of Milano François Laviolette, Université Laval John ShaweTaylor, University College London 
Invited Speakers 
Shie Mannor,
Technion Peter Auer, Montanuniversität Leoben John Langford, Yahoo! Naftali Tishby, The Hebrew University 
Schedule

7:50  8:00 Opening remarks 8:00  8:45 Shie Mannor  "Model Selection in Markovian Processes" 8:45  9:05 Amirmassoud Farahmand  "BErMin: A Model Selection Algorithm for Reinforcement Learning Problems" 9:05  9:15 Break 9:15  9:35 OdalricAmbrym Maillard  "Selecting the StateRepresentation in Reinforcement Learning" 9:35  10:20 Peter Auer  "Autonomous Exploration in Reinforcement Learning" 10:20  16:00 Break 16:00  16:45 John Langford  "Model Selection when Learning with Exploration" 16:45  17:45 Posters 17:45  18:30 Naftali Tishby  "Future Information Minimization as PACBayes Regularization in Reinforcement Learning" 18:30  19:30 Panel Discussion 
Posters 
Amirmassoud Farahmand and Csaba Szepesvári  "BErMin: A Model Selection
Algorithm for Reinforcement Learning Problems"  to
appear in Machine Learning Journal. Aurélie Boisbunon  "Criteria for variable selection with dependence" Marina Sapir  "Bias Plus Variance Decomposition for Survival Analysis Problems" Zhou Bai and Stefan C. Kremer  "Using Deepbelief Networks for Model Order Selection: A Case study on DNA Sequence Classification" OdalricAmbrym Maillard, Rémi Munos, and Daniil Ryabko  "Selecting the StateRepresentation in Reinforcement Learning"  In NIPS, 2011. Mohammad Ghavamzadeh, Alessandro Lazaric, Rémi Munos, and Matthew Hoffman  "FiniteSample Analysis of LassoTD"  In ICML, 2011. S. Clémençon, N. Baskiotis, and N. Usunier  "Calibrating SVMs with Vfold penalization" Morteza Haghir Chehreghani, Alberto Giovanni Busetto, and Joachim M. Buhmann  "Cluster Model Validation by Maximizing Approximation Capacity" Yuri Grinberg, Mahdi Milani Fard, and Joelle Pineau  "LSTD on Sparse Spaces" Alexandre Lacoste, François Laviolette, and Mario Marchand  "Comparing Learning Algorithms on Multiple Datasets" Yevgeny Seldin, Peter Auer, François Laviolette, John ShaweTaylor, and Ronald Ortner  "PACBayesian Analysis of Contextual Bandits"  In NIPS, 2011. 
Sponsors


Talk Abstracts 
Shie Mannor (Technion)  "Model
Selection in Markovian Processes" We address the problem of how to use a sample of trajectories to choose from a candidate set of possible state spaces in different types of Markov processes. Standard approaches to solving this problem for static models use penalized maximum likelihood criteria that take the likelihood of the trajectory into account. Surprisingly, these criteria do not work even for simple fully observable finite Markov processes. We propose an alternative criterion and show that it is consistent. We then provide a guarantee on its performance with finite samples and illustrate its accuracy using simulated data and realworld data. We finally address the question of model selection in Markov decision processes, where the decision maker can actively select actions to assist in model selection. Amirmassoud Farahmand (McGill University)  "BErMin: A Model Selection Algorithm for Reinforcement Learning Problems" We consider the problem of model selection in the batch (offline, noninteractive) reinforcement learning setting when the goal is to find an actionvalue function with the smallest Bellman error among a countable set of candidate functions. We propose a complexity regularizationbased model selection algorithm, BErMin, and prove that it enjoys an oraclelike property: the estimator's error differs from that of an oracle, who selects the candidate with the minimum Bellman error, by only a constant factor and a small remainder term that vanishes at a parametric rate as the number of samples increases. Joint work with Csaba Szepesvári. OdalricAmbrym Maillard (Montanuniversität Leoben)  "Selecting the StateRepresentation in Reinforcement Learning" The problem of selecting the right staterepresentation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T^2/3 where T is the horizon time. Joint work with Rémi Munos and Daniil Ryabko. Peter Auer (Montanuniversität Leoben)  "Autonomous Exploration in Reinforcement Learning" One of the striking differences between current reinforcement learning algorithms and early human learning is that animals and infants appear to explore their environments with autonomous purpose, in a manner appropriate to their current level of skills. For analysing such autonomous exploration theoretically, an evaluation criterion is required to compare exploration algorithms. Unfortunately, no commonly agreed evaluation criterion has been established yet. As one possible criterion, we consider in this work the navigation skill of a learning agent after a number of exploration steps. In particular, we consider how many exploration steps are required, until the agent has learned reliable policies for reaching all states in a certain distance from a start state. (Related but more general objectives are also of interest.) While this learning problem can be addressed in a straightforward manner for finite MDPs, it becomes much more interesting for potentially infinite (but discrete) MDPs. For infinite MDPs we can analyse how the learning agent increases its navigation skill for reaching more distant states, as the exploration time increases. We show that an optimistic exploration strategy learns reliable policies when the number of exploration steps is linear in the number of reachable states and in the number of actions. The number of reachable states is not known to the algorithm, but the algorithm adapts to this number. Joint work with Shiau Hong Lim and Chris Watkins. John Langford (Yahoo! Research)  "Model Selection when Learning with Exploration" I will discuss model selection in 4 settings: {Selective Sampling, Partial Feedback} x {Agnostic,Realizable}. In selective sampling, you choose on which examples to acquire a label. In partial feedback, you choose on which label (or action) to discover a reward (or loss). In the agnostic setting, your goal is simply competing a set of predictors. In the realizable setting, one of your predictors is perfect, for varying definitions of perfect. Naftali Tishby (The Hebrew University of Jerusalem)  "Future Information Minimization as PACBayes Regularization in Reinforcement Learning" Interactions between an organism and its environment are commonly treated in the framework of Markov Decision Processes (MDP). While standard MDP is aimed at maximizing expected future rewards (value), the circular flow of information between the agent and its environment is generally ignored. In particular, the information gained from the environment by means of perception and the information involved in the process of action selection are not treated in the standard MDP setting. In this talk, we focus on the control information and show how it can be combined with the reward measure in a unified way. Both of these measures satisfy the familiar Bellman recursive equations, and their linear combination (the freeenergy) provides an interesting new optimization criterion. The tradeoff between value and information, explored using our INFORL algorithm, provides a principled justification for stochastic (soft) policies. These optimal policies are also shown to be robust to uncertainties in the reward values by applying the PACBayes generalization bound. The same PACBayesian bounding term thus plays the dual roles of informationgain in the InformationRL formalism and as a modelorder regularization term in the learning of the process. 