Abstract
In the domain of online reinforcement learning, strategies that leverage inherent rewards for exploration tend to achieve commendable outcomes within contexts characterized by deceptive or sparse rewards. Counting through the visitation of states is an efficient count-based exploration method to get the proper intrinsic reward. However, only the novelty of the states encountered by the agent is considered in this exploration method, resulting in the over-exploration of a certain state-action pair and falling into a locally optimal solution. In this paper, a count-based method called the visitation count of state-action pairs (VCSAP) is proposed, which is based on the strong error correction ability of online reinforcement learning. VCSAP counts both the visitation of individual states and state-action pairs, which not only drives the agent to visit novel states, but also motivates the agent to select novel actions. MuJoCo is an advanced multi-joint dynamics simulator, and MuJoCo environments with sparse rewards are more challenging and closer to real-world environments. VCSAP is applied to Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO) respectively, and comparative experiments with exploration baselines are conducted on multiple tasks of MuJoCo and sparse MuJoCo benchmark. The experimental results show that compared to Random Network Distillation method, the performance of PPO-VCSAP and TRPO-VCSAP improves by 18% and 8% in 8 environments.
| Original language | English |
|---|---|
| Article number | 107052 |
| Number of pages | 15 |
| Journal | Neural Networks |
| Volume | 184 |
| Early online date | 1 Jan 2025 |
| DOIs | |
| Publication status | Published - Apr 2025 |
Bibliographical note
Publisher Copyright:© 2025 Elsevier Ltd
Funding
We sincerely thank the anonymous reviewers for their careful work and thoughtful suggestions, which have greatly improved this paper. This work was supported by the Natural Science Research Foundation of Jilin Province of China under Grant No. 20220101106JC, the National Natural Science Foundation of China under Grant No. 61300049, the Fundamental Research Funds for the Central Universities (Jilin University) under Grant No. 93K172022K10, the Fundamental Research Funds for the Central Universities (Northeast Normal University) under Grant Nos. 2412022QD040 and 2412022ZD018, and the National Key R&D Program of China under Grant No. 2017YFB1003103.
| Funders | Funder number |
|---|---|
| Fundamental Research Funds for the Central Universities | |
| Northeast Normal University | 2412022QD040, 2412022ZD018 |
| National Key Research and Development Program of China | 2017YFB1003103 |
| Natural Science Foundation of Jilin Province | 20220101106JC |
| Jilin University | 93K172022K10 |
| National Natural Science Foundation of China | 61300049 |
Keywords
- Count-based exploration method
- Intrinsic reward
- Online reinforcement learning
- Visitation count