Abstract
We present Pure-Past Action Masking (PPAM), a lightweight approach to action masking for safe reinforcement learning. In PPAM, actions are disallowed (“masked”) according to specifications expressed in Pure-Past Linear Temporal Logic (PPLTL). PPAM can enforce non-Markovian constraints, i.e., constraints based on the history of the system, rather than just the current state of the (possibly hidden) MDP. The features used in the safety constraint need not be the same as those used by the learning agent, allowing a clear separation of concerns between the safety constraints and reward specifications of the (learning) agent. We prove formally that an agent trained with PPAM can learn any optimal policy that satisfies the safety constraints, and that they are as expressive as shields, another approach to enforce non-Markovian constraints in RL. Finally, we provide empirical results showing how PPAM can guarantee constraint satisfaction in practice.
Original language | English |
---|---|
Pages (from-to) | 21646-21655 |
Number of pages | 10 |
Journal | Proceedings of the AAAI Conference on Artificial Intelligence |
Volume | 38 |
Issue number | 19 |
DOIs | |
Publication status | Published - 24 Mar 2024 |
Bibliographical note
Publisher Copyright:Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Funding
This work was supported by PNRR MUR project PE0000013-FAIR, partially supported by ERC Advanced Grant WhiteMech (No. 834228), EU ICT-48 2020 project TAILOR (No. 952215), the ONRG project N62909-22-1-2005, the InDAM-GNCS project \u201CStrategic Reasoning in Mechanism Design\u201D, and the project OCENW.M.21.377 funded by the Dutch Research Council (NWO). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising from this submission.
Funders | Funder number |
---|---|
Nederlandse Organisatie voor Wetenschappelijk Onderzoek | |
PNRR | PE0000013-FAIR |
European Research Council | 834228 |
EU ICT-48 2020 | 952215, N62909-22-1-2005 |