Bootstrapped Policy Learning: Goal Shaping for Efficient Task-oriented Dialogue Policy Learning

Yangyang Zhao, Mehdi Dastani, Shihan Wang

Research output: Contribution to journalConference articleAcademicpeer-review

Abstract

Reinforcement Learning (RL) shows promise in optimizing task-oriented dialogue policies, but addressing the challenge of reward sparsity remains challenging. Curriculum learning offers an effective solution by strategically training dialogue policies from simple to complex, facilitating a smooth knowledge transition across varied goal complexities. However, these methods typically assume that goal difficulty will increase gradually to adapt to difficult goals over time. In complex environments lacking intermediate goals, attaining smooth knowledge transitions becomes tricky. This paper proposes a novel Bootstrapped Policy Learning (BPL) framework that adaptively tailors a curriculum for each complex goal through goal shaping, which consists of progressively challenging subgoals. Goal shaping comprises goal decomposition and evolution, breaking complex goals into solvable subgoals and progressively increasing subgoal difficulty as the policy improves. BPL harmoniously combines these aspects, enabling smooth knowledge transitions from simple to complex goals, thereby enhancing task-oriented dialogue policy learning efficiency. Our experiments demonstrate the effectiveness of BPL in two complex dialogue environments.

Original languageEnglish
Pages (from-to)2615-2617
Number of pages3
JournalProceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS
Volume2024
Issue numberMay
Publication statusPublished - 6 May 2024
Event23rd International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2024 - Auckland, New Zealand
Duration: 6 May 202410 May 2024

Keywords

  • Curriculum Learning
  • Dialogue Policy
  • Goal Shaping
  • Reinforcement Learning

Fingerprint

Dive into the research topics of 'Bootstrapped Policy Learning: Goal Shaping for Efficient Task-oriented Dialogue Policy Learning'. Together they form a unique fingerprint.

Cite this