Exploration-Exploitation
in Reinforcement Learning

-- rlgammazero --

EWRL 2022

Invited Talks at EWRL 2022

Exploration in Goal-Oriented Reinforcement Learning

Pirotta
Slides

Understanding Unsupervised Exploration for Goal-Based RL

Lazaric
Slides
Matteo Pirotta

M. Pirotta is a research scientist at META in Paris. Previously, he was a postdoc at Inria in the SequeL team. He received his PhD in computer science from the Politecnico di Milano (Italy) in 2016. For his doctoral thesis in reinforcement learning, he received the Dimitris N. Chorafas Foundation Award and an honorable mention for the EurAI Distinguished Dissertation Award. His main research interest is reinforcement learning. In the last years, he has mainly focused on the exploration-exploitation dilemma in RL.

Alessandro Lazaric

A. Lazaric is a research scientist at the META (FAIR) lab since 2017 and he was previously a researcher at INRIA in the SequeL team. His main research topic is reinforcement learning, with extensive contributions on both the theoretical and algorithmic aspects of RL. In the last ten years, he has studied the exploration-exploitation dilemma both in the multi-armed bandit and reinforcement learning framework, notably on the problems of regret minimization, best-arm identification, pure exploration, and hierarchical RL. He has published over 40 papers in top machine learning conferences and journals.

AAAI'20

Tutorial at AAAI'20

Exploration in Reinforcement Learning

Ghavamzadeh, Lazaric, Pirotta

Reinforcement Learning (RL) studies the problem of sequential decision-making when the environment (i.e., the dynamics and the reward) is initially unknown but can be learned through direct interaction. RL algorithms recently achieved impressive results in a variety of problems including games and robotics.

Nonetheless, most of recent RL algorithms require a huge amount of data to learn a satisfactory policy and cannot be used in domains where samples are expensive and/or long simulations are not possible (e.g., human-computer interaction). A fundamental step towards more sample-efficient algorithms is to devise methods to properly balance the exploration of the environment, to gather useful information, and the exploitation of the learned policy to collect as much reward as possible.

The objective of the tutorial is to bring awareness of the importance of the exploration-exploitation dilemma in improving the sample-efficiency of modern RL algorithms. The tutorial will provide the audience with a review of the major algorithmic principles (notably, optimism in face of uncertainty and posterior sampling), their theoretical guarantees in the exact case (i.e., tabular RL) and their application to more complex environments, including parameterized MDPs, linear-quadratic control, and their integration with deep learning architectures. The tutorial should provide enough theoretical and algorithmic background to enable researchers in AI and RL to integrate exploration principles in existing RL algorithms and devise novel sample-efficient RL methods able to deal with complex applications such as human-computer interaction (e.g., conversational agents), medical applications (e.g., drug optimization), and advertising (e.g., lifetime value optimization in marketing). Throughout the whole tutorial, we will discuss open problems and possible future research directions.

If you want to cite this tutorial please use

          @misc{glp2020aaaitutorial,
            author       = {Mohammad Ghavamzadeh and
              Alessandro Lazaric and
              Matteo Pirotta},
              title        = {Exploration in Reinforcement Learning},
              howpublished = {Tutorial at AAAI'20},
              year         = {2020},
              url          = {https://rlgammazero.github.io/},
            }
          

Mohammad Ghavamzadeh

M. Ghavamzadeh received a PhD from UMass Amherst in 2005. He was a postdoctoral fellow at UAlberta from 2005 to 2008. He has been a permanent researcher at INRIA since 2008. He was the recipient of the "INRIA award for scientific excellence" in 2011, and obtained his Habilitation in 2014. Since 2013, he has been a senior researcher, first at Adobe Research, then at DeepMind, and now at Facebook AI Research (FAIR). He has published over 70 refereed papers in major machine learning, AI, and control journals and conferences.

Alessandro Lazaric

A. Lazaric is a research scientist at the Facebook AI Research (FAIR) lab since 2017 and he was previously a researcher at INRIA in the SequeL team. His main research topic is reinforcement learning, with extensive contributions on both the theoretical and algorithmic aspects of RL. In the last ten years, he has studied the exploration-exploitation dilemma both in the multi-armed bandit and reinforcement learning framework, notably on the problems of regret minimization, best-arm identification, pure exploration, and hierarchical RL. He has published over 40 papers in top machine learning conferences and journals.

Matteo Pirotta

M. Pirotta is a research scientist at Facebook AI Research (FAIR) lab in Paris. Previously, he was a postdoc at INRIA in the SequeL team. He received his PhD in computer science from the Politecnico di Milano (Italy) in 2016. For his doctoral thesis in reinforcement learning, he received the Dimitris N. Chorafas Foundation Award and an honorable mention for the EurAI Distinguished Dissertation Award. His main research interest is reinforcement learning. In the last years, he has mainly focused on the exploration-exploitation dilemma in RL.

ALT'19

Tutorial at ALT'19

Regret Minimization in Infinite-Horizon Finite Markov Decision Processes

Fruit, Lazaric, Pirotta

Reinforcement Learning (RL) studies the problem of sequential decision-making when the environment (i.e., the dynamics and the reward) is initially unknown but can be learned through direct interaction. A crucial step in the learning problem is to properly balance the exploration of the environment, in order to gather useful information, and the exploitation of the learned policy to collect as much reward as possible.

Recent theoretical results proved that approaches based on optimism or posterior sampling (e.g., UCRL, PSRL, etc.) successfully solve the exploration-exploitation dilemma and may require exponentially less samples than simpler (but very popular) techniques such as epsilon-greedy to converge to near-optimal policies. While the optimism and posterior sampling principles are directly inspired by multi-armed bandit literature, RL poses specific challenges (e.g., how "local" uncertainty propagates through the Markov dynamics), which requires a more sophisticated theoretical analysis.

The focus of the tutorial is to provide a formal definition of the exploration-exploitation dilemma, discuss its challenges, and to review the main algorithmic principles and their theoretical guarantees for different optimality criteria (notably finite-horizon and average-reward problems). Throughout the whole tutorial we will discuss open problems and possible future research directions.

If you want to cite this tutorial please use

          @misc{flp2019alttutorial,
            author       = {Ronan Fruit and
              Alessandro Lazaric and
              Matteo Pirotta},
              title        = {Regret Minimization in Infinite-Horizon Finite Markov Decision Processes},
              howpublished = {Tutorial at ALT'19},
              year         = {2019},
              url          = {https://rlgammazero.github.io/},
            }
          

Ronan Fruit

R. Fruit is a third year PhD student in the SequeL team at Inria under the supervision of Alessandro Lazaric and Daniil Ryabko. He is currently research intern at Facebook AI Research (FAIR) Montreal. His research focuses on the theoretical understanding of the exploration-exploitation dilemma in Reinforcement Learning and the design of algorithms with provably good regret guarantees.

Alessandro Lazaric

A. Lazaric is a research scientist at the Facebook AI Research (FAIR) lab since 2017 and he was previously a researcher at Inria in the SequeL team. His main research topic is reinforcement learning, with extensive contributions on both the theoretical and algorithmic aspects of RL. In the last ten years he has studied the exploration-exploitation dilemma both in the multi-armed bandit and reinforcement learning framework, notably on the problems of regret minimization, best-arm identification, pure exploration, and hierarchical RL.

Matteo Pirotta

M. Pirotta is a research scientist at Facebook AI Research (FAIR) lab in Paris. Previously, he was a postdoc at Inria in the SequeL team. He received his PhD in computer science from the Politecnico di Milano (Italy) in 2016. For his doctoral thesis in reinforcement learning, he received the Dimitris N. Chorafas Foundation Award and an honorable mention for the EurAI Distinguished Dissertation Award. His main research interest is reinforcement learning. In the last years, he has mainly focused on the exploration-exploitation dilemma in RL.

Resources

Tutorials

Notes