Abstract
A problem faced by autonomous robots is that of achieving
quick, efficient operation in unseen variations of their tasks after
experiencing a subset of these variations sampled offline at training
time. We model the task variability in terms of a family of MDPs differing
in transition dynamics and reward processes. In the case when
it is not possible to experiment in the new world, e.g., in real-time
situations, a policy for novel instances may be defined by averaging
over the policies of the offline instances. This would be suboptimal
in the general case, and for this we propose an alternate model that
draws on the methodology of hierarchical reinforcement learning,
wherein we learn partial policies for partial goals (subtasks) in the
offline MDPs, in the form of options, and we treat solving a novel
MDP as one of sequential composition of partial policies. Our procedure
utilises a modified version of option interruption for control
switching where the interruption signal is acquired from offline experience.
We also show that desirable performance advantages can
be attained in situations where the task can be decomposed into concurrent
subtasks, allowing us to devise an alternate control structure
that emphasises flexible switching and concurrent use of policy fragments.
We demonstrate the utility of these ideas using example gridworld
domains with variability in task.
Original language | English |
---|---|
Title of host publication | Proceedings of the 5th International Workshop on Evolutionary and Reinforcement Learning for Autonomous Robot Systems (ERLARS) |
Number of pages | 8 |
Publication status | Published - 2012 |