Gain-based Exploration: From Multi-armed Bandits to Partially Observable Environments

B. Si, J. M. Herrmann, K. Pawelzik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We introduce gain-based policies for exploration in active learning problems. For exploration in multi-armed bandits with the knowledge of reward variances, an ideal gain-maximization exploration policy is described in a unified framework which also includes error-based and counter-based exploration. For realistic situations without prior knowledge of reward variances, we establish an upper bound on the gain function, resulting in a realistic gain- maximization exploration policy which achieves the optimal exploration asymptotically. Finally, we extend the gain- maximization exploration scheme towards partially observable environments. Approximating the environment by a set of local bandits, the agent actively selects its actions by maximizing discounted gain in learning local bandits. The resulting gain-based exploration not only outperforms random exploration, but also produces curiosity-driven behavior which is observed in natural agents.
Original languageEnglish
Title of host publicationNatural Computation, 2007. ICNC 2007. Third International Conference on
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages177-182
Number of pages6
Volume1
ISBN (Print)978-0-7695-2875-5
DOIs
Publication statusPublished - 1 Aug 2007

Keywords

  • decision making
  • knowledge acquisition
  • learning (artificial intelligence)
  • counter-based exploration
  • error-based exploration
  • gain-maximization exploration policy
  • multi-armed bandits
  • optimal exploration asymptotically
  • partially observable environments
  • Decision making
  • Entropy
  • Estimation error
  • Gain measurement
  • Knowledge acquisition
  • Learning
  • Redundancy
  • Robots
  • Testing
  • Upper bound

Fingerprint

Dive into the research topics of 'Gain-based Exploration: From Multi-armed Bandits to Partially Observable Environments'. Together they form a unique fingerprint.

Cite this