Learning to plan: planning as an action in simple reinforcement learning agents
-
1
Columbia University, United States
-
2
University of Minnesota, Department of Neuroscience, United States
Current neuroscientific theories of decision making emphasize that behavior can be controlled by different brain systems with different properties. A common distinction is that between a model-free, stimulus-response, "habit" system on the one hand, and a model-based, flexible "planning" system on the other. Planning tends to be prominent early during learning before transitioning to more habitual control, and is often specific to important choice points (e.g. Tolman, 1938), implying that planning processes can be selectively engaged as circumstances demand. Current models of model-based decision making lack a mechanism to account for selective planning; for instance, the influential Daw et al. (2005) model plans at every action, using an external mechanism to arbitrate between planned and model-free control. Thus, there is currently no model of planning that defines its relationship to model-free control while respecting how humans and animals actually behave. To address this, we explored a "T-maze grid world" reinforcement learning model where the agent can choose to plan. The value of planning is learned along with that of other actions (turn left, etc.) and is updated after an N-step fixed policy (the "plan") is executed, offset by a fixed planning cost. The contents of the plan consist of either a random sequences of moves (random plan control) or the sequence of moves that leads to the highest valued state on the agent’s internal value function (true plan). Consistent with previous results (Sutton, 1990), we find that planning speeds learning. Furthermore, while agents plan frequently during initial learning, with experience, the non-planning actions gradually increase in value and win out. Interestingly, even in this simple environment, the agent shows a selective increase in planning actions specifically at the choice point under appropriate conditions. We explore a number of variations of the model, including a hierarchical version where action values are learned for the model-free and model-based controller separately. Thus, a simple Q-learning model which includes an added planning action, the value of which is learned alongside that of other actions, can reproduce two salient aspects of planning data: planning is prominent early but gives way to habitual control with experience, and planning occurs specifically at points appropriate to the structure of the task. The fact that these phenomena can be learned in a simple reinforcement learning architecture suggests this as an alternative to models that use a supplemental arbitration mechanism between planning and habitual control. By treating planning as a choice, the model can generate specific predictions about what point in time, where in the environment, and how far ahead or for how long, agents may choose to plan. More generally, the current approach is an example of blurring the boundary between agent and environment, such that actions (like planning) can be inside the agent alone instead of having to affect the environment, and a demonstration that the state space can include internal variables (such as the contents of a plan), similar to the role of working memory (O’Reilly and Frank, 2006; Zilli and Hasselmo, 2007).
Conference:
Computational and Systems Neuroscience 2010, Salt Lake City, UT, United States, 25 Feb - 2 Mar, 2010.
Presentation Type:
Poster Presentation
Topic:
Poster session II
Citation:
Wimmer
GE and
Van Der Meer
M
(2010). Learning to plan: planning as an action in simple reinforcement learning agents.
Front. Neurosci.
Conference Abstract:
Computational and Systems Neuroscience 2010.
doi: 10.3389/conf.fnins.2010.03.00246
Copyright:
The abstracts in this collection have not been subject to any Frontiers peer review or checks, and are not endorsed by Frontiers.
They are made available through the Frontiers publishing platform as a service to conference organizers and presenters.
The copyright in the individual abstracts is owned by the author of each abstract or his/her employer unless otherwise stated.
Each abstract, as well as the collection of abstracts, are published under a Creative Commons CC-BY 4.0 (attribution) licence (https://creativecommons.org/licenses/by/4.0/) and may thus be reproduced, translated, adapted and be the subject of derivative works provided the authors and Frontiers are attributed.
For Frontiers’ terms and conditions please see https://www.frontiersin.org/legal/terms-and-conditions.
Received:
05 Mar 2010;
Published Online:
05 Mar 2010.
*
Correspondence:
G E Wimmer, Columbia University, New York, United States, e.wimmer@ucl.ac.uk