Double Q-learning

H.P. van Hasselt

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    Abstract

    In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation.
    Original languageEnglish
    Title of host publicationAdvances in Neural Information Processing Systems 23
    EditorsJ. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, A. Culotta
    Pages2613-2621
    Number of pages9
    Publication statusPublished - 6 Dec 2010

    Bibliographical note

    Neural Information Processing Systems (NIPS 2010)

    Fingerprint

    Dive into the research topics of 'Double Q-learning'. Together they form a unique fingerprint.

    Cite this