Publications

2010
Gershman, S. J., Cohen, J. D., & Niv, Y. (2010). Learning to selectively attend. In 32nd Annual Conference of the Cognitive Science Society. PDFAbstract
How is reinforcement learning possible in a high-dimensional world? Without making any assumptions about the struc- ture of the state space, the amount of data required to effec- tively learn a value function grows exponentially with the state space’s dimensionality. However, humans learn to solve high- dimensional problems much more rapidly than would be ex- pected under this scenario. This suggests that humans em- ploy inductive biases to guide (and accelerate) their learning. Here we propose one particular bias—sparsity—that amelio- rates the computational challenges posed by high-dimensional state spaces, and present experimental evidence that humans can exploit sparsity information when it is available.
Niv, Y., & Gershman, S. J. (2010). Representation Learning and Reinforcement Learning : An fMRI study of learning to selectively attend. In Society for Neuroscience Abstracts.
2009
Niv, Y., & Montague, R. P. (2009). Theoretical and Empirical Studies of Learning. Neuroeconomics , 331–351 . Elsevier. PDFAbstract
This chapter introduces the reinforcement learning framework and gives a brief background to the origins and history of reinforcement learning models of decision-making. Reinforcement learning provides a normative framework, within which conditioning can be analyzed. That is, this suggests a means by which optimal prediction and action selection can be achieved, and exposes explicitly the computations that must be realized in the service of these. In contrast to descriptive models that describe behavior as it is, normative models study behavior from the point of view of its hypothesized function-that is, they study behavior, as it should be if it were to accomplish specific goals in an optimal way. The appeal of normative models derives from several sources. Historically, the core ideas in reinforcement learning arose from two separate and parallel lines of research. One axis is mainly associated with Richard Sutton, formerly an undergraduate psychology major, and his PhD advisor, Andrew Barto, a computer scientist. Interested in artificial intelligence and agent-based learning, Sutton and Barto developed algorithms for reinforcement learning that were inspired by the psychological literature on Pavlovian and instrumental conditioning. ©2009 Elsevier Inc. All rights reserved.
Botvinick, M. M., Niv, Y., & Barto, A. G. (2009). Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition , 113 (3), 262–280. PDFAbstract
Research on human and animal behavior has long emphasized its hierarchical structure-the divisibility of ongoing behavior into discrete tasks, which are comprised of subtask sequences, which in turn are built of simple actions. The hierarchical structure of behavior has also been of enduring interest within neuroscience, where it has been widely considered to reflect prefrontal cortical functions. In this paper, we reexamine behavioral hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning. Specifically, we consider a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills. A close look at the components of hierarchical reinforcement learning suggests how they might map onto neural structures, in particular regions within the dorsolateral and orbital prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement learning might provide a complement to existing psychological models of hierarchically structured behavior. A particularly important question that hierarchical reinforcement learning brings to the fore is that of how learning identifies new action routines that are likely to provide useful building blocks in solving a wide range of future problems. Here and at many other points, hierarchical reinforcement learning offers an appealing framework for investigating the computational and neural underpinnings of hierarchically structured behavior.
Todd, M. T., Niv, Y., & Cohen, J. D. (2009). Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Ed.), Advances in Neural Information Processing Systems 21 (pp. 1689–1696). PDFAbstract
Working memory is a central topic of cognitive neuroscience because it is critical for solving real-world problems in which information from multiple temporally distant sources must be combined to generate appropriate behavior. However, an often neglected fact is that learning to use working memory effectively is itself a difficult problem. The Gating framework is a collection of psychological models that show how dopamine can train the basal ganglia and prefrontal cortex to form useful working memory representations in certain types of problems. We unite Gating with machine learning theory concerning the general problem of memory-based optimal control. We present a normative model that learns, by online temporal difference methods, to use working memory to maximize discounted future reward in partially observable settings. The model successfully solves a benchmark working memory problem, and exhibits limitations similar to those observed in humans. Our purpose is to introduce a concise, normative definition of high level cognitive concepts such as working memory and cognitive control in terms of maximizing discounted future rewards. 1
Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology , 53 (3), 139–154. PDFAbstract
A wealth of research focuses on the decision-making processes that animals and humans employ when selecting actions in the face of reward and punishment. Initially such work stemmed from psychological investigations of conditioned behavior, and explanations of these in terms of computational models. Increasingly, analysis at the computational level has drawn on ideas from reinforcement learning, which provide a normative framework within which decision-making can be analyzed. More recently, the fruits of these extensive lines of research have made contact with investigations into the neural basis of decision making. Converging evidence now links reinforcement learning to specific neural substrates, assigning them precise computational roles. Specifically, electrophysiological recordings in behaving animals and functional imaging of human decision-making have revealed in the brain the existence of a key reinforcement learning signal, the temporal difference reward prediction error. Here, we first introduce the formal reinforcement learning framework. We then review the multiple lines of evidence linking reinforcement learning to the function of dopaminergic neurons in the mammalian midbrain and to more recent data from human imaging experiments. We further extend the discussion to aspects of learning not associated with phasic dopamine signals, such as learning of goal-directed responding that may not be dopamine-dependent, and learning about the vigor (or rate) with which actions should be performed that has been linked to tonic aspects of dopaminergic signaling. We end with a brief discussion of some of the limitations of the reinforcement learning framework, highlighting questions for future research.
2008
Niv, Y., & Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in Cognitive Sciences , 12 (7), 265–272. PDFAbstract
The recognition that computational ideas from reinforcement learning are relevant to the study of neural circuits has taken the cognitive neuroscience community by storm. A central tenet of these models is that discrepancies between actual and expected outcomes can be used for learning. Neural correlates of such prediction-error signals have been observed now in midbrain dopaminergic neurons, striatum, amygdala and even prefrontal cortex, and models incorporating prediction errors have been invoked to explain complex phenomena such as the transition from goal-directed to habitual behavior. Yet, like any revolution, the fast-paced progress has left an uneven understanding in its wake. Here, we provide answers to ten simple questions about prediction errors, with the aim of exposing both the strengths and the limitations of this active area of neuroscience research. ©2008 Elsevier Ltd. All rights reserved.
Schiller, D., Levy, I., Niv, Y., LeDoux, J. E., & Phelps, E. A. (2008). From Fear to Safety and Back: Reversal of Fear in the Human Brain. Journal of Neuroscience , 28 (45), 11517–11525. PDFAbstract
Fear learning is a rapid and persistent process that promotes defense against threats and reduces the need to relearn about danger. However, it is also important to flexibly readjust fear behavior when circumstances change. Indeed, a failure to adjust to changing conditions may contribute to anxiety disorders. A central, yet neglected aspect of fear modulation is the ability to flexibly shift fear responses from one stimulus to another if a once-threatening stimulus becomes safe or a once-safe stimulus becomes threatening. In these situations, the inhibition of fear and the development of fear reactions co-occur but are directed at different targets, requiring accurate responding under continuous stress. To date, research on fear modulation has focused mainly on the shift from fear to safety by using paradigms such as extinction, resulting in a reduction of fear. The aim of the present study was to track the dynamic shifts from fear to safety and from safety to fear when these transitions occur simultaneously. We used functional neuroimaging in conjunction with a fear-conditioning reversal paradigm. Our results reveal a unique dissociation within the ventromedial prefrontal cortex between a safe stimulus that previously predicted danger and a "naive" safe stimulus. We show that amygdala and striatal responses tracked the fear-predictive stimuli, flexibly flipping their responses from one predictive stimulus to another. Moreover, prediction errors associated with reversal learning correlated with striatal activation. These results elucidate how fear is readjusted to appropriately track environmental changes, and the brain mechanisms underlying the flexible control of fear.
Dayan, P. D., & Niv, Y. (2008). Reinforcement learning: The Good, The Bad and The Ugly. Current Opinion in Neurobiology , 18 (2), 185–196. PDFAbstract
Reinforcement learning provides both qualitative and quantitative frameworks for understanding and modeling adaptive decision-making in the face of rewards and punishments. Here we review the latest dispatches from the forefront of this field, and map out some of the territories where lie monsters. ©2008 Elsevier Ltd. All rights reserved.
Takahashi, Y. K. (2008). Silencing the critics: understanding the effects of cocaine sensitization on dorsolateral and ventral striatum in the context of an actor/critic model. Frontiers in Neuroscience , 2 (1), 86–99. PDFAbstract
A critical problem in daily decision making is how to choose actions now in order to bring about rewards later. Indeed, many of our actions have long-term consequences, and it is important to not be myopic in balancing the pros and cons of different options, but rather to take into account both immediate and delayed consequences of actions. Failures to do so may be manifest as persistent, maladaptive decision-making, one example of which is addiction where behavior seems to be driven by the immediate positive experiences with drugs, despite the delayed adverse consequences. A recent study by Takahashi et al. (2007) investigated the effects of cocaine sensitization on decision making in rats and showed that drug use resulted in altered representations in the ventral striatum and the dorsolateral striatum, areas that have been implicated in the neural instantiation of a computational solution to optimal long-term actions selection called the Actor/Critic framework. In this Focus article we discuss their results and offer a computational interpretation in terms of drug-induced impairments in the Critic. We first survey the different lines of evidence linking the subparts of the striatum to the Actor/Critic framework, and then suggest two possible scenarios of breakdown that are suggested by Takahashi et al.'s (2007) data. As both are compatible with the current data, we discuss their different predictions and how these could be empirically tested in order to further elucidate (and hopefully inch towards curing) the neural basis of drug addiction.
2007
Niv, Y. (2007). Cost, benefit, tonic, phasic: What do response rates tell us about dopamine and motivation? Annals of the New York Academy of Sciences , 1104, 357–376. PDFAbstract
The role of dopamine in decision making has received much attention from both the experimental and computational communities. However, because reinforcement learning models concentrate on discrete action selection and on phasic dopamine signals, they are silent as to how animals decide upon the rate of their actions, and they fail to account for the prominent effects of dopamine on response rates. We suggest an extension to reinforcement learning models in which response rates are optimally determined by balancing the tradeoff between the cost of fast responding and the benefit of rapid reward acquisition. The resulting behavior conforms well with numerous characteristics of free-operant responding. More importantly, this framework highlights a role for a tonic signal corresponding to the net rate of rewards, in determining the optimal rate of responding. We hypothesize that this critical quantity is conveyed by tonic levels of dopamine, explaining why dopaminergic manipulations exert a global affect on response rates. We further suggest that the effects of motivation on instrumental rates of responding are mediated through its influence on the net reward rate, implying a tight coupling between motivational states and tonic dopamine. The relationships between phasic and tonic dopamine signaling, and between directing and energizing effects of motivation, as well as the implications for motivational control of habitual and goal-directed instrumental action selection, are discussed. ©2007 New York Academy of Sciences.
Niv, Y. (2007). The effects of motivation on habitual instrumental behavior . The Hebrew University of Jerusalem. PDFAbstract
This thesis provides a normative computational analysis of how motivation affects decision making. More specifically, we provide a reinforcement learning model of optimal self-paced (free-operant) learning and behavior, and use it to address three broad classes of questions: (1) Why do animals work harder in some instrumental tasks than in others? (2) How do motivational states affect responding in such tasks, particu- larly in those cases in which behavior is habitual, that is, when responding is insensitive to changes in the specific worth of its goals, such as a higher value of food when hungry rather than sated? and (3) Why do dopaminergic manipulations cause global changes in the vigor of responding, and how is this related to prominent accounts of the role of dopamine in providing basal ganglia and frontal cortical areas with a reward prediction error signal that can be used for learning to choose between actions? A fundamental question in behavioral neuroscience concerns the decision-making processes by which an- imals and humans select actions in the face of reward and punishment. In Chapter 1 we provide a brief overview of the current status of this research, focused on three themes: behavior, computation and neural substrates. In behavioral psychology, this question has been investigated through the paradigms of Pavlo- vian (classical) and instrumental (operant) conditioning, and much evidence has accumulated regarding the associations that control different aspects of learned behavior. The computational field of reinforcement learning has provided a normative framework within which conditioned behavior can be understood. In this, optimal action selection is based on predictions of long-run future consequences, such that decision making is aimed at maximizing rewards and minimizing punishment. Neuroscientific evidence from lesion studies, pharmacological manipulations and electrophysiological recordings in behaving animals have further pro- vided tentative links to neural structures underlying key computational constructs in these models. Most notably, much evidence suggests that the neuromodulator dopamine provides basal ganglia target structures with a reward prediction error that can influence learning and action selection, particularly in stimulus-driven habitual instrumental behavior. However, although reinforcement learning models have long promised to unify computational, psycholog- ical and neural accounts of appetitively conditioned behavior, we claim here that they suffer from a large theoretical oversight. While a bulk of data on animal conditioning comes from free-operant experiments measuring how fast animals will work for reinforcement, existing reinforcement learning models lack any notion of vigor or response rate, focusing instead only on competition between different responses, and so they are silent about these tasks. In Chapter 2 we first review the basic characteristics of free-operant behavior, illustrating the effects of reinforcement schedules on rates of responding. We then develop a rein- forcement learning model in which vigor selection is optimized together with response selection. The model suggests that subjects choose how vigorously to perform selected actions by optimally balancing the costs and benefits of different speeds of responding. Importantly, we show that this model accounts normatively for effects of reinforcement schedules on response rates, such as the fact that responding on ratio schedules is faster than responding on interval schedules that yield the same rate of reinforcement. Finally, the model highlights the importance of the net rate of rewards in quantifying the opportunity cost of time, and thus in determining response vigor. In Chapter 3 we flesh out the implications of this model for the motivational control of habitual behavior. In general, understanding the effects of motivation on instrumental action selection is fundamental to the study of decision making. Recent work has shown that motivational control can be used to divide instrumental behavior into two classes: ‘goal-directed' behavior is immediately sensitive to motivation-induced changes in the values of its specific consequences, while ‘habitual' behavior is not. Because habitual behavior constitutes a large proportion of our daily activities, it is thus important to ask how does motivation affect habitual behavior? That is, how can habitual behavior be performed such as to achieve motivationally relevant goals? In Chapter 3 we flesh out the implications of this model for the motivational control of habitual behavior. In general, understanding the effects of motivation on instrumental action selection is fundamental to the study of decision making. Recent work has shown that motivational control can be used to divide instrumental behavior into two classes: ‘goal-directed' behavior is immediately sensitive to motivation-induced changes in the values of its specific consequences, while ‘habitual' behavior is not. Because habitual behavior constitutes a large proportion of our daily activities, it is thus important to ask how does motivation affect habitual behavior? That is, how can habitual behavior be performed such as to achieve motivationally relevant goals? We start by defining motivation as a mapping from outcomes to utilities. Incorporating this into the com- putational framework of optimal response rates, we show that in general, the optimal effects of motivation on behavior should be two-fold: On the one hand, motivation should affect the choice between actions such that actions leading to those outcomes that are more highly valued are more probable. This corresponds to the traditional directing effect of motivation. On the other hand, by influencing the opportunity cost of time, motivation should affect the response rates of all chosen actions, irrespective of their specific outcomes, as in the decades-old (but controversial) idea that motivation energizes behavior. This global effect of moti- vation explains not only why hungry rats work harder for food, but also sheds light on the counterintuitive observation that they will sometimes work harder for water. Based on the computational view of habitual behavior as arising from cached values summarizing long-run reward predictions, we suggest that habitual action selection can direct responding properly only in those motivational states which pertained during behavioral training. However, this does not imply insensitivity to novel motivational states. In these, we propose that the outcome-independent, global effects of motivational can ‘energize' habitual actions, as a well-founded approximation to the optimal solution in a trained situation. That is, habitual response rates can be adjusted to the current motivational state, in a way that is optimal given the computational limitations of the habitual system. Our computational framework suggests that the energizing function of motivation is mediated by the ex- pected net rate of rewards. In Chapter 4, we put forth the hypothesis that this important quantity is reported by tonic levels of dopamine. Dopamine neurotransmission has long been known to exert a powerful influ- ence over the vigor, strength or rate of responding. However, there exists no clear understanding of the computational foundation for this effect. Previous reinforcement learning models of habitual behavior have indeed suggested an interpretation of the function of dopaminergic signals in the brain. However, these have concentrated only on the role of precisely timed phasic dopaminergic signals in learning the predictive value of different actions, and have ignored both tonic dopamine transmission and response vigor. Our tonic dopamine hypothesis focuses on the involvement of dopamine in the control of vigor, explaining why higher levels of dopamine are associated with globally higher response rates, ie, why, like motivation, dopamine ‘energizes' behavior. In this way, through the computational framework of optimal choice of response rates, we suggest an explanation of the motivational control of habitual behavior, on both the behavioral and the neural levels. Reinforcement learning models of animal learning are appealing not only because they provide a normative basis for decision-making, but also because they show that optimal action selection can be learned through online incremental experience with the environment, using only locally available information. To complete the picture of how dopamine influences free-operant learning and behavior, in Chapter 5 we describe an online algorithm of the type usually associated with instrumental learning and decision-making, which is suitable for learning to select actions and latencies according to our new framework. There are two major differences between learning in our model and previous online reinforcement learning algorithms: First, most prior applications have dealt with discounted reinforcement learning while we use average reward reinforcement learning. Second, unlike previous models that have focused on discrete action selection, the action space in our model is inherently continuous, as it includes a choice of response latency. We thus propose a new online learning algorithm that is specifically suitable for our needs. In this, building on the experimental characteristics of response latencies, we suggest a functional parameterization of the action space that drastically reduces the complexity of learning. Moreover, we suggest a formulation of online action selection in which response rates are directly affected by the net reward rate. We show that our algorithm learns to respond appropriately, and with nearly optimal latencies, and discuss its implications for the differences between the learning of interval and ratio schedules. In Chapter 6, the last of the main chapters, we deviate from the theoretical analysis of behavior, to describe two instrumental condi
Niv, Y., & Rivlin-Etzion, M. (2007). Parkinson's Disease: Fighting the Will? Journal of Neuroscience , 27 (44), 11777–11779. PDFAbstract
A phenomenon familiar to clinicians treating patients with Parkinson's disease (PD) is kinesia paradoxica: astonishing displays of sudden mobility and agility by otherwise akinetic PD patients in instances of emergency (for instance,
Niv, Y., Daw, N. D., Joel, D., & Dayan, P. D. (2007). Tonic dopamine: Opportunity costs and the control of response vigor. Psychopharmacology , 191 (3), 507–520. PDFAbstract
RATIONALE: Dopamine neurotransmission has long been known to exert a powerful influence over the vigor, strength, or rate of responding. However, there exists no clear understanding of the computational foundation for this effect; predominant accounts of dopamine's computational function focus on a role for phasic dopamine in controlling the discrete selection between different actions and have nothing to say about response vigor or indeed the free-operant tasks in which it is typically measured. OBJECTIVES: We seek to accommodate free-operant behavioral tasks within the realm of models of optimal control and thereby capture how dopaminergic and motivational manipulations affect response vigor. METHODS: We construct an average reward reinforcement learning model in which subjects choose both which action to perform and also the latency with which to perform it. Optimal control balances the costs of acting quickly against the benefits of getting reward earlier and thereby chooses a best response latency. RESULTS: In this framework, the long-run average rate of reward plays a key role as an opportunity cost and mediates motivational influences on rates and vigor of responding. We review evidence suggesting that the average reward rate is reported by tonic levels of dopamine putatively in the nucleus accumbens. CONCLUSIONS: Our extension of reinforcement learning models to free-operant tasks unites psychologically and computationally inspired ideas about the role of tonic dopamine in striatum, explaining from a normative point of view why higher levels of dopamine might be associated with more vigorous responding.
2006
Daw, N. D., Niv, Y., & Dayan, P. D. (2006). Actions, Policies, Values, and the Basal Ganglia. In E. Bezard (Ed.), Recent Breakthroughs in Basal Ganglia Research (pp. 111–130) . Nova Science Publishers Inc. PDF
Niv, Y., Daw, N. D., & Dayan, P. D. (2006). Choice values. Nature Neuroscience , 9 (8), 987–988. PDFAbstract
Dopaminergic neurons are thought to inform decisions by reporting errors in reward prediction. A new study reports dopaminergic responses as monkeys make choices, supporting one computational theory of appetitive learning.
Dayan, P. D., Niv, Y., Seymour, B., & Daw, N. D. (2006). The misbehavior of value and the discipline of the will. Neural Networks , 19 (8), 1153–1160. PDFAbstract
Most reinforcement learning models of animal conditioning operate under the convenient, though fictive, assumption that Pavlovian conditioning concerns prediction learning whereas instrumental conditioning concerns action learning. However, it is only through Pavlovian responses that Pavlovian prediction learning is evident, and these responses can act against the instrumental interests of the subjects. This can be seen in both experimental and natural circumstances. In this paper we study the consequences of importing this competition into a reinforcement learning context, and demonstrate the resulting effects in an omission schedule and a maze navigation task. The misbehavior created by Pavlovian values can be quite debilitating; we discuss how it may be disciplined. ©2006 Elsevier Ltd. All rights reserved.
Niv, Y., Edlund, J. A., Dayan, P. D., & O'Doherty, J. P. (2006). Neural correlates of risk-sensitivity: An fMRI study of instrumental choice behavior. In Society for Neuroscience Abstracts.
Niv, Y., Joel, D., & Dayan, P. D. (2006). A normative perspective on motivation. Trends in Cognitive Science , 10 (8), 375–381. PDFAbstract
Understanding the effects of motivation on instrumental action selection, and specifically on its two main forms, goal-directed and habitual control, is fundamental to the study of decision making. Motivational states have been shown to 'direct' goal-directed behavior rather straightforwardly towards more valuable outcomes. However, how motivational states can influence outcome-insensitive habitual behavior is more mysterious. We adopt a normative perspective, assuming that animals seek to maximize the utilities they achieve, and viewing motivation as a mapping from outcomes to utilities. We suggest that habitual action selection can direct responding properly only in motivational states which pertained during behavioral training. However, in novel states, we propose that outcome-independent, global effects of the utilities can 'energize' habitual actions.
2005
Niv, Y., Duff, M. O., & Dayan, P. D. (2005). Dopamine, uncertainty and TD learning. Behavioral and Brain Functions , 1 6. PDFAbstract
Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively. However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric. We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors. In particular, we show that when averaging the non-stationary prediction errors across trials, a ramping in the activity of the dopamine neurons should be apparent, whose magnitude is dependent on the learning rate. This exact phenomenon was observed in a recent experiment, though being interpreted there in antipodal terms as a within-trial encoding of uncertainty.
Niv, Y., Daw, N. D., & Dayan, P. D. (2005). How fast to work: response vigor, motivation and tonic dopamine. In Y. Weiss, B. Schölkopf, & J. Platt (Ed.), Neural Information Processing Systems (Vol. 18, pp. 1019–1026) . MIT Press. PDFAbstract
Reinforcement learning models have long promised to unify computa- tional, psychological and neural accounts of appetitively conditioned be- havior. However, the bulk of data on animal conditioning comes from free-operant experiments measuring how fast animals will work for rein- forcement. Existing reinforcement learning (RL) models are silent about these tasks, because they lack any notion of vigor. They thus fail to ad- dress the simple observation that hungrier animals will work harder for food, as well as stranger facts such as their sometimes greater produc- tivity even when working for irrelevant outcomes such as water. Here, we develop an RL framework for free-operant behavior, suggesting that subjects choose how vigorously to perform selected actions by optimally balancing the costs and benefits of quick responding. Motivational states such as hunger shift these factors, skewing the tradeoff. This accounts normatively for the effects of motivation on response rates, as well as many other classic findings. Finally, we suggest that tonic levels of dopamine may be involved in the computation linking motivational state to optimal responding, thereby explaining the complex vigor-related ef- fects of pharmacological manipulation of dopamine.
Niv, Y., Daw, N. D., Joel, D., & Dayan, P. D. (2005). Motivational effects on behavior: Towards a reinforcement learning model of rates of responding. In CoSyNe . Salt Lake City, Utah. PDF
Daw, N. D., Niv, Y., & Dayan, P. D. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience , 8 (12), 1704–1711. PDFAbstract
A broad range of neural and behavioral data suggests that the brain contains multiple systems for behavioral choice, including one associated with prefrontal cortex and another with dorsolateral striatum. However, such a surfeit of control raises an additional choice problem: how to arbitrate between the systems when they disagree. Here, we consider dual-action choice systems from a normative perspective, using the computational theory of reinforcement learning. We identify a key trade-off pitting computational simplicity against the flexible and statistically efficient use of experience. The trade-off is realized in a competition between the dorsolateral striatal and prefrontal systems. We suggest a Bayesian principle of arbitration between them according to uncertainty, so each controller is deployed when it should be most accurate. This provides a unifying account of a wealth of experimental evidence about the factors favoring dominance by either system.
2002
Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: new anatomical and computational perspectives. Neural Networks , 15 (4-6), 535–547. PDFAbstract
A large number of computational models of information processing in the basal ganglia have been developed in recent years. Prominent in these are actor-critic models of basal ganglia functioning, which build on the strong resemblance between dopamine neuron activity and the temporal difference prediction error signal in the critic, and between dopamine-dependent long-term synaptic plasticity in the striatum and learning guided by a prediction error signal in the actor. We selectively review several actor-critic models of the basal ganglia with an emphasis on two important aspects: the way in which models of the critic reproduce the temporal dynamics of dopamine firing, and the extent to which models of the actor take into account known basal ganglia anatomy and physiology. To complement the efforts to relate basal ganglia mechanisms to reinforcement learning (RL), we introduce an alternative approach to modeling a critic network, which uses Evolutionary Computation techniques to 'evolve' an optimal RL mechanism, and relate the evolved mechanism to the basic model of the critic. We conclude our discussion of models of the critic by a critical discussion of the anatomical plausibility of implementations of a critic in basal ganglia circuitry, and conclude that such implementations build on assumptions that are inconsistent with the known anatomy of the basal ganglia. We return to the actor component of the actor-critic model, which is usually modeled at the striatal level with very little detail. We describe an alternative model of the basal ganglia which takes into account several important, and previously neglected, anatomical and physiological characteristics of basal ganglia-thalamocortical connectivity and suggests that the basal ganglia performs reinforcement-biased dimensionality reduction of cortical inputs. We further suggest that since such selective encoding may bias the representation at the level of the frontal cortex towards the selection of rewarded plans and actions, the reinforcement-driven dimensionality reduction framework may serve as a basis for basal ganglia actor models. We conclude with a short discussion of the dual role of the dopamine signal in RL and in behavioral switching. Copyright ©2002 Elsevier Science Ltd.
Niv, Y., Joel, D., Meilijson, I., & Ruppin, E. (2002). Evolution of reinforcement learning in foraging bees: A simple explanation for risk averse behavior. Neurocomputing , 44-46, 951–956. PDFAbstract
Reinforcement learning is a fundamental process by which organisms learn to achieve goals from their interactions with the environment. We use evolutionary computation techniques to derive (near-)optimal neuronal learning rules in a simple neural network model of decision-making in simulated bumblebees foraging for nectar. The resulting bees exhibit efficient reinforcement learning. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels and to the well-documented foraging strategy of risk aversion. This behavior is shown to emerge directly from optimal reinforcement learning, providing a biologically founded, parsimonious and novel explanation of risk-averse behavior. ©2002 Published by Elsevier Science B.V.
Niv, Y., Joel, D., Meilijson, I., & Ruppin, E. (2002). Evolution of Reinforcement Learning in Uncertain Environments: A Simple Explanation for Complex Foraging Behaviors. Adaptive Behavior , 10 (1), 5–24. PDFAbstract
Reinforcement learning is a fundamental process by which organisms learn to achieve goals from their interactions with the environment. Using evolutionary computation techniques we evolve (near-)optimal neuronal learning rules in a simple neural network model of reinforcement learning in bumblebees foraging for nectar. The resulting neural networks exhibit efficient reinforcement learning, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels and to the well-documented choice strategies of risk aversion and probability matching. Additionally, risk aversion is shown to emerge even when bees are evolved in a completely risk-less environment. In contrast to existing theories in economics and game theory, risk-averse behavior is shown to be a direct consequence of (near-)optimal reinforcement learning, without requiring additional assumptions such as the existence of a nonlinear subjective utility function for rewards. Our results are corroborated by a rigorous mathematical analysis, and their robustness in real-world situations is supported by experiments in a mobile robot. Thus we provide a biologically founded, parsimonious, and novel explanation for risk aversion and probability matching.
2001
Niv, Y., Joel, D., Meilijson, I., & Ruppin, E. (2001). Evolution of Reinforcement Learning in Uncertain Environments : Emergence of Risk-Aversion and Matching. Learning . Tel-Aviv University. PDFAbstract
©Springer-Verlag Berlin Heidelberg 2001.Reinforcement learning (RL) is a fundamental process by which organisms learn to achieve a goal from interactions with the environment. Using Artificial Life techniques we derive (near-)optimal neuronal learning rules in a simple neural network model of decision-making in simulated bumblebees foraging for nectar. The resulting networks exhibit efficient RL, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels from which emerge the welldocumented foraging strategies of risk aversion and probability matching. These are shown to be a direct result of optimal RL, providing a biologically founded, parsimonious and novel explanation for these behaviors. Our results are corroborated by a rigorous mathematical analysis and by experiments in mobile robots.

Pages