Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

Citation:

Bennett, D., Niv, Y., & Langdon, A. J. (2021). Value-free reinforcement learning: policy optimization as a minimal model of operant behavior. Current Opinion in Behavioral Sciences , 41, 114–121.

Abstract:

Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of ‘value’ in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.

PDF

Last updated on 06/02/2021