VCSAP: Online reinforcement learning exploration method based on visitation count of state-action pairs

Ruikai Zhou, Wenbo Zhu, Shuai Han, Meng Kang, Shuai Lü*

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

In the domain of online reinforcement learning, strategies that leverage inherent rewards for exploration tend to achieve commendable outcomes within contexts characterized by deceptive or sparse rewards. Counting through the visitation of states is an efficient count-based exploration method to get the proper intrinsic reward. However, only the novelty of the states encountered by the agent is considered in this exploration method, resulting in the over-exploration of a certain state-action pair and falling into a locally optimal solution. In this paper, a count-based method called the visitation count of state-action pairs (VCSAP) is proposed, which is based on the strong error correction ability of online reinforcement learning. VCSAP counts both the visitation of individual states and state-action pairs, which not only drives the agent to visit novel states, but also motivates the agent to select novel actions. MuJoCo is an advanced multi-joint dynamics simulator, and MuJoCo environments with sparse rewards are more challenging and closer to real-world environments. VCSAP is applied to Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO) respectively, and comparative experiments with exploration baselines are conducted on multiple tasks of MuJoCo and sparse MuJoCo benchmark. The experimental results show that compared to Random Network Distillation method, the performance of PPO-VCSAP and TRPO-VCSAP improves by 18% and 8% in 8 environments.

Original languageEnglish
Article number107052
Number of pages15
JournalNeural Networks
Volume184
Early online date1 Jan 2025
DOIs
Publication statusPublished - Apr 2025

Bibliographical note

Publisher Copyright:
© 2025 Elsevier Ltd

Funding

We sincerely thank the anonymous reviewers for their careful work and thoughtful suggestions, which have greatly improved this paper. This work was supported by the Natural Science Research Foundation of Jilin Province of China under Grant No. 20220101106JC, the National Natural Science Foundation of China under Grant No. 61300049, the Fundamental Research Funds for the Central Universities (Jilin University) under Grant No. 93K172022K10, the Fundamental Research Funds for the Central Universities (Northeast Normal University) under Grant Nos. 2412022QD040 and 2412022ZD018, and the National Key R&D Program of China under Grant No. 2017YFB1003103.

FundersFunder number
Fundamental Research Funds for the Central Universities
Northeast Normal University2412022QD040, 2412022ZD018
National Key Research and Development Program of China2017YFB1003103
Natural Science Foundation of Jilin Province20220101106JC
Jilin University93K172022K10
National Natural Science Foundation of China61300049

    Keywords

    • Count-based exploration method
    • Intrinsic reward
    • Online reinforcement learning
    • Visitation count

    Fingerprint

    Dive into the research topics of 'VCSAP: Online reinforcement learning exploration method based on visitation count of state-action pairs'. Together they form a unique fingerprint.

    Cite this