Exploration-Exploitation
in Reinforcement Learning

-- rlgammazero --

ALT'19

Tutorial at ALT'19

Regret Minimization in Infinite-Horizon Finite Markov Decision Processes

Reinforcement Learning (RL) studies the problem of sequential decision-making when the environment (i.e., the dynamics and the reward) is initially unknown but can be learned through direct interaction. A crucial step in the learning problem is to properly balance the exploration of the environment, in order to gather useful information, and the exploitation of the learned policy to collect as much reward as possible.

Recent theoretical results proved that approaches based on optimism or posterior sampling (e.g., UCRL, PSRL, etc.) successfully solve the exploration-exploitation dilemma and may require exponentially less samples than simpler (but very popular) techniques such as epsilon-greedy to converge to near-optimal policies. While the optimism and posterior sampling principles are directly inspired by multi-armed bandit literature, RL poses specific challenges (e.g., how "local" uncertainty propagates through the Markov dynamics), which requires a more sophisticated theoretical analysis.

The focus of the tutorial is to provide a formal definition of the exploration-exploitation dilemma, discuss its challenges, and to review the main algorithmic principles and their theoretical guarantees for different optimality criteria (notably finite-horizon and average-reward problems). Throughout the whole tutorial we will discuss open problems and possible future research directions.

Alessandro Lazaric

A. Lazaric is a research scientist at the Facebook AI Research (FAIR) lab since 2017 and he was previously a researcher at Inria in the SequeL team. His main research topic is reinforcement learning, with extensive contributions on both the theoretical and algorithmic aspects of RL. In the last ten years he has studied the exploration-exploitation dilemma both in the multi-armed bandit and reinforcement learning framework, notably on the problems of regret minimization, best-arm identification, pure exploration, and hierarchical RL.

Matteo Pirotta

M. Pirotta is a research scientist at Facebook AI Research (FAIR) lab in Paris. Previously, he was a postdoc at Inria in the SequeL team. He received his PhD in computer science from the Politecnico di Milano (Italy) in 2016. For his doctoral thesis in reinforcement learning, he received the Dimitris N. Chorafas Foundation Award and an honorable mention for the EurAI Distinguished Dissertation Award. His main research interest is reinforcement learning. In the last years, he has mainly focused on the exploration-exploitation dilemma in RL.

Ronan Fruit

R. Fruit is a third year PhD student in the SequeL team at Inria under the supervision of Alessandro Lazaric and Daniil Ryabko. He is currently research intern at Facebook AI Research (FAIR) Montreal. His research focuses on the theoretical understanding of the exploration-exploitation dilemma in Reinforcement Learning and the design of algorithms with provably good regret guarantees.

Additional Resources

UCRL2B Regret

Proof of \(\sqrt{D}\) regret for UCRL2B

PSRL Lemma

Notes on the \(\sqrt{S}\) dependence of PSRL

Online Reinforcement Learning Library

Implementation of several approaches for regret minimization in reinforcement learning

Python library with cython implementation of planning algorithms

Library is hosted on GitHub