Publications

Starting in 2022, the lab has decided to post here only the archival, open-access version of our publications. This is part of the movement to emphasize quality and content over the impact factor or prestige of the journal a paper is published in. Full citations (for referencing papers in your own work) can be found on pubmed and/or within the archival version, which will be updated once a paper is accepted for publication after peer review.

Asterisk (*) denotes equal contribution

Advanced Filters

2011

Eldar, E., Morris, G., & Niv, Y. (2011). The effects of motivation on response rate: A hidden semi-Markov model analysis of behavioral dynamics. Journal of Neuroscience Methods, 201(1), 251–261. https://doi.org/10.1016/j.jneumeth.2011.06.028
A central goal of neuroscience is to understand how neural dynamics bring about the dynamics of behavior. However, neural and behavioral measures are noisy, requiring averaging over trials and subjects. Unfortunately, averaging can obscure the very dynamics that we are interested in, masking abrupt changes and artificially creating gradual processes. We develop a hidden semi-Markov model for precisely characterizing dynamic processes and their alteration due to experimental manipulations. This method takes advantage of multiple trials and subjects without compromising the information available in individual events within a trial. We apply our model to studying the effects of motivation on response rates, analyzing data from hungry and sated rats trained to press a lever to obtain food rewards on a free-operant schedule. Our method can accurately account for punctate changes in the rate of responding and for sequential dependencies between responses. It is ideal for inferring the statistics of underlying response rates and the probability of switching from one response rate to another. Using the model, we show that hungry rats have more distinct behavioral states that are characterized by high rates of responding and they spend more time in these high-press-rate states. Moreover, hungry rats spend less time in, and have fewer distinct states that are characterized by a lack of responding (Waiting/Eating states). These results demonstrate the utility of our analysis method, and provide a precise quantification of the effects of motivation on response rates. ©2011 Elsevier B.V.
McDannald, M., Lucantonio, F., Burke, K., Niv, Y., & Schoenbaum, G. (2011). Ventral Striatum and Orbitofrontal Cortex Are Both Required for Model-Based, But Not Model-Free, Reinforcement Learning. Journal of Neuroscience, 31(7), 2700–2705. https://doi.org/10.1523/JNEUROSCI.5499-10.2011 (Original work published 2011)
In many cases, learning is thought to be driven by differences between the value of rewards we expect and rewards we actually receive. Yet learning can also occur when the identity of the reward we receive is not as expected, even if its value remains unchanged. Learning from changes in reward identity implies access to an internal model of the environment, from which information about the identity of the expected reward can be derived. As a result, such learning is not easily accounted for by model-free reinforcement learning theories such as temporal difference reinforcement learning (TDRL), which predicate learning on changes in reward value, but not identity. Here, we used unblocking procedures to assess learning driven by value- versus identity-based prediction errors. Rats were trained to associate distinct visual cues with different food quantities and identities. These cues were subsequently presented in compound with novel auditory cues and the reward quantity or identity was selectively changed. Unblocking was assessed by presenting the auditory cues alone in a probe test. Consistent with neural implementations of TDRL models, we found that the ventral striatum was necessary for learning in response to changes in reward value. However, this area, along with orbitofrontal cortex, was also required for learning driven by changes in reward identity. This observation requires that existing models of TDRL in the ventral striatum be modified to include information about the specific features of expected outcomes derived from model-based representations, and that the role of orbitofrontal cortex in these models be clearly delineated.
Ribas-Fernandes, J., Solway, A., Diuk, C., McGuire, J., Barto, A., Niv, Y., & Botvinick, M. (2011). A neural signature of hierarchical reinforcement learning. Neuron, 71(2), 370–379. https://doi.org/10.1016/j.neuron.2011.05.042 (Original work published 2011)
Human behavior displays hierarchical structure: simple actions cohere into subtask sequences, which work together to accomplish overall task goals. Although the neural substrates of such hierarchy have been the target of increasing research, they remain poorly understood. We propose that the computations supporting hierarchical behavior may relate to those in hierarchical reinforcement learning (HRL), a machine-learning framework that extends reinforcement-learning mechanisms into hierarchical domains. To test this, we leveraged a distinctive prediction arising from HRL. In ordinary reinforcement learning, reward prediction errors are computed when there is an unanticipated change in the prospects for accomplishing overall task goals. HRL entails that prediction errors should also occur in relation to task subgoals. In three neuroimaging studies we observed neural responses consistent with such subgoal-related reward prediction errors, within structures previously implicated in reinforcement learning. The results reported support the relevance of HRL to the neural processes underlying hierarchical behavior.

2010

Dayan, P., Daw, N., & Niv, Y. (2010). Learning, Action, Inference and Neuromodulation. In Encyclopedia of Neuroscience (pp. 455–462). https://doi.org/10.1016/B978-008045046-9.01415-7
Neuromodulators such as dopamine, serotonin, norepinephrine, and acetylcholine play a critical role in controlling the learning about the association between stimuli and actions and rewards and punishments, and also in regulating the way that networks process information and effect decisions. Precise characterization of at least one part of their function in these respects comes from theories originating in statistics, operations research, and engineering, and computational theory ties together neural, psychological, ethological, and even economic notions. ©2009 Elsevier Ltd All rights reserved.
Diuk, C., Botvinick, M., Barto, A., & Niv, Y. (2010). Hierarchical reinforcement learning: an fmri study of learning in a two-level gambling task. Neuroscience Meeting Planner.
Todd, M., Cohen, J., & Niv, Y. (2010). Identifying internal representations of context in fMRI. Society for Neuroscience Abstracts.
Niv, Y., & Gershman, S. (2010). Representation Learning and Reinforcement Learning : An fMRI study of learning to selectively attend. Society for Neuroscience Abstracts.
Gershman, S., Cohen, J., & Niv, Y. (2010). Learning to selectively attend. 32nd Annual Conference of the Cognitive Science Society. PDF: Learning to selectively attend
How is reinforcement learning possible in a high-dimensional world? Without making any assumptions about the struc- ture of the state space, the amount of data required to effec- tively learn a value function grows exponentially with the state space’s dimensionality. However, humans learn to solve high- dimensional problems much more rapidly than would be ex- pected under this scenario. This suggests that humans em- ploy inductive biases to guide (and accelerate) their learning. Here we propose one particular bias—sparsity—that amelio- rates the computational challenges posed by high-dimensional state spaces, and present experimental evidence that humans can exploit sparsity information when it is available.
Gershman, S., & Niv, Y. (2010). Learning latent structure: carving nature at its joints. Curr Opin Neurobiol, 20(2), 251–256. https://doi.org/10.1016/j.conb.2010.02.008 (Original work published 2010)
Reinforcement learning (RL) algorithms provide powerful explanations for simple learning and decision-making behaviors and the functions of their underlying neural substrates. Unfortunately, in real-world situations that involve many stimuli and actions, these algorithms learn pitifully slowly, exposing their inferiority in comparison to animal and human learning. Here we suggest that one reason for this discrepancy is that humans and animals take advantage of structure that is inherent in real-world tasks to simplify the learning problem. We survey an emerging literature on 'structure learning'–using experience to infer the structure of a task–and how this can be of service to RL, with an emphasis on structure in perception and action.
Both the orbitofrontal cortex (OFC) and ventral striatum (VS) have been implicated in signalling reward expectancies. However, exactly what roles these two disparate structures play, and how they are different is very much an open question. Recent results from the Schoenbaum lab (Takahashi et al., this meeting) describing the detailed effect of OFC lesions on putative reward prediction error signalling by midbrain dopaminergic neurons of rats, point to one possible delineation. Here we describe a reinforcement learning (RL) model of the Takahashi et al. results, that suggests related, but slightly different roles for the OFC and VS in signalling reward expectancies. We present an actor/critic model with one actor (putatively the dorsal striatum) and two critics (OFC and VS). We hypothesise that the VS critic learns state values relatively slowly and in a model free way, while OFC learns state values faster and in a model based way, using one step look ahead. Both areas contribute to a single prediction error signal, computed in ventral tegmental area (VTA), that is used to teach both critics and the actor. As they receive the same teaching signal, the critics, OFC and VS, essentially compete for the value of each state. Our model makes a number of predictions regarding the effects of OFC and VS lesions on the response properties of dopaminergic (putatively prediction error encoding) neurons in VTA. The model predicts that lesions to either VS or OFC result in persistent prediction errors to predictable rewards and diminished prediction errors on the omission of predictable rewards. At the time of a reward predicting cue, the model predicts that these lesions cause both positive and negative prediction errors to be diminished. When the animal is free to choose between a high and low valued option, we predict a difference between the effects of OFC and VS lesions. In the “unlesioned” model, because of the proposed look-ahead abilities of OFC, the model predicts differential signals at the time at which the decision is made, corresponding to whether the high or low valued option has been chosen. When the model-OFC is “lesioned”, however, these differential signals disappear as the model is no longer aware of the decision that will be made. This is not the case when model-VS is “lesioned” in which case the difference between high and low valued options persists. These predictions regarding OFC lesions are born out in Takahashi et al.'s experiments on rats.
Gershman, S., Blei, D., & Niv, Y. (2010). Context, Learning, and Extinction. Psychological Review, 117(1), 197–209. https://doi.org/10.1037/a0017808 (Original work published 2010)
A. Redish et al. (2007) proposed a reinforcement learning model of context-dependent learning and extinction in conditioning experiments, using the idea of "state classification" to categorize new observations into states. In the current article, the authors propose an interpretation of this idea in terms of normative statistical inference. They focus on renewal and latent inhibition, 2 conditioning paradigms in which contextual manipulations have been studied extensively, and show that online Bayesian inference within a model that assumes an unbounded number of latent causes can characterize a diverse set of behavioral results from such manipulations, some of which pose problems for the model of Redish et al. Moreover, in both paradigms, context dependence is absent in younger animals, or if hippocampal lesions are made prior to training. The authors suggest an explanation in terms of a restricted capacity to infer new causes.

2009

Niv, Y., & Montague, R. (2009). Theoretical and Empirical Studies of Learning. Neuroeconomics, 331–351. https://doi.org/10.1016/B978-0-12-374176-9.00022-1
This chapter introduces the reinforcement learning framework and gives a brief background to the origins and history of reinforcement learning models of decision-making. Reinforcement learning provides a normative framework, within which conditioning can be analyzed. That is, this suggests a means by which optimal prediction and action selection can be achieved, and exposes explicitly the computations that must be realized in the service of these. In contrast to descriptive models that describe behavior as it is, normative models study behavior from the point of view of its hypothesized function-that is, they study behavior, as it should be if it were to accomplish specific goals in an optimal way. The appeal of normative models derives from several sources. Historically, the core ideas in reinforcement learning arose from two separate and parallel lines of research. One axis is mainly associated with Richard Sutton, formerly an undergraduate psychology major, and his PhD advisor, Andrew Barto, a computer scientist. Interested in artificial intelligence and agent-based learning, Sutton and Barto developed algorithms for reinforcement learning that were inspired by the psychological literature on Pavlovian and instrumental conditioning. ©2009 Elsevier Inc. All rights reserved.
Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology, 53(3), 139–154. PDF: Reinforcement learning in the brain
A wealth of research focuses on the decision-making processes that animals and humans employ when selecting actions in the face of reward and punishment. Initially such work stemmed from psychological investigations of conditioned behavior, and explanations of these in terms of computational models. Increasingly, analysis at the computational level has drawn on ideas from reinforcement learning, which provide a normative framework within which decision-making can be analyzed. More recently, the fruits of these extensive lines of research have made contact with investigations into the neural basis of decision making. Converging evidence now links reinforcement learning to specific neural substrates, assigning them precise computational roles. Specifically, electrophysiological recordings in behaving animals and functional imaging of human decision-making have revealed in the brain the existence of a key reinforcement learning signal, the temporal difference reward prediction error. Here, we first introduce the formal reinforcement learning framework. We then review the multiple lines of evidence linking reinforcement learning to the function of dopaminergic neurons in the mammalian midbrain and to more recent data from human imaging experiments. We further extend the discussion to aspects of learning not associated with phasic dopamine signals, such as learning of goal-directed responding that may not be dopamine-dependent, and learning about the vigor (or rate) with which actions should be performed that has been linked to tonic aspects of dopaminergic signaling. We end with a brief discussion of some of the limitations of the reinforcement learning framework, highlighting questions for future research.
Todd, M., Niv, Y., Cohen, J., Koller, ., Schuurmans, ., Bengio, ., & Bottou, . (2009). Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement. Advances in Neural Information Processing Systems 21, 1689–1696. https://doi.org/10.1371/journal.pone.0075455
Working memory is a central topic of cognitive neuroscience because it is critical for solving real-world problems in which information from multiple temporally distant sources must be combined to generate appropriate behavior. However, an often neglected fact is that learning to use working memory effectively is itself a difficult problem. The Gating framework is a collection of psychological models that show how dopamine can train the basal ganglia and prefrontal cortex to form useful working memory representations in certain types of problems. We unite Gating with machine learning theory concerning the general problem of memory-based optimal control. We present a normative model that learns, by online temporal difference methods, to use working memory to maximize discounted future reward in partially observable settings. The model successfully solves a benchmark working memory problem, and exhibits limitations similar to those observed in humans. Our purpose is to introduce a concise, normative definition of high level cognitive concepts such as working memory and cognitive control in terms of maximizing discounted future rewards. 1
Botvinick, M., Niv, Y., & Barto, A. (2009). Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition, 113(3), 262–280. https://doi.org/10.1016/j.cognition.2008.08.011 (Original work published 2009)
Research on human and animal behavior has long emphasized its hierarchical structure-the divisibility of ongoing behavior into discrete tasks, which are comprised of subtask sequences, which in turn are built of simple actions. The hierarchical structure of behavior has also been of enduring interest within neuroscience, where it has been widely considered to reflect prefrontal cortical functions. In this paper, we reexamine behavioral hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning. Specifically, we consider a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills. A close look at the components of hierarchical reinforcement learning suggests how they might map onto neural structures, in particular regions within the dorsolateral and orbital prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement learning might provide a complement to existing psychological models of hierarchically structured behavior. A particularly important question that hierarchical reinforcement learning brings to the fore is that of how learning identifies new action routines that are likely to provide useful building blocks in solving a wide range of future problems. Here and at many other points, hierarchical reinforcement learning offers an appealing framework for investigating the computational and neural underpinnings of hierarchically structured behavior.

2008

Dayan, P., & Niv, Y. (2008). Reinforcement learning: The Good, The Bad and The Ugly. Current Opinion in Neurobiology, 18(2), 185–196. https://doi.org/10.1016/j.conb.2008.08.003
Reinforcement learning provides both qualitative and quantitative frameworks for understanding and modeling adaptive decision-making in the face of rewards and punishments. Here we review the latest dispatches from the forefront of this field, and map out some of the territories where lie monsters. ©2008 Elsevier Ltd. All rights reserved.
Niv, Y., & Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in Cognitive Sciences, 12(7), 265–272. https://doi.org/10.1016/j.tics.2008.03.006
The recognition that computational ideas from reinforcement learning are relevant to the study of neural circuits has taken the cognitive neuroscience community by storm. A central tenet of these models is that discrepancies between actual and expected outcomes can be used for learning. Neural correlates of such prediction-error signals have been observed now in midbrain dopaminergic neurons, striatum, amygdala and even prefrontal cortex, and models incorporating prediction errors have been invoked to explain complex phenomena such as the transition from goal-directed to habitual behavior. Yet, like any revolution, the fast-paced progress has left an uneven understanding in its wake. Here, we provide answers to ten simple questions about prediction errors, with the aim of exposing both the strengths and the limitations of this active area of neuroscience research. ©2008 Elsevier Ltd. All rights reserved.
Schiller, D., Levy, I., Niv, Y., LeDoux, J., & Phelps, E. (2008). From Fear to Safety and Back: Reversal of Fear in the Human Brain. Journal of Neuroscience, 28(45), 11517–11525. https://doi.org/10.1523/JNEUROSCI.2265-08.2008
Fear learning is a rapid and persistent process that promotes defense against threats and reduces the need to relearn about danger. However, it is also important to flexibly readjust fear behavior when circumstances change. Indeed, a failure to adjust to changing conditions may contribute to anxiety disorders. A central, yet neglected aspect of fear modulation is the ability to flexibly shift fear responses from one stimulus to another if a once-threatening stimulus becomes safe or a once-safe stimulus becomes threatening. In these situations, the inhibition of fear and the development of fear reactions co-occur but are directed at different targets, requiring accurate responding under continuous stress. To date, research on fear modulation has focused mainly on the shift from fear to safety by using paradigms such as extinction, resulting in a reduction of fear. The aim of the present study was to track the dynamic shifts from fear to safety and from safety to fear when these transitions occur simultaneously. We used functional neuroimaging in conjunction with a fear-conditioning reversal paradigm. Our results reveal a unique dissociation within the ventromedial prefrontal cortex between a safe stimulus that previously predicted danger and a "naive" safe stimulus. We show that amygdala and striatal responses tracked the fear-predictive stimuli, flexibly flipping their responses from one predictive stimulus to another. Moreover, prediction errors associated with reversal learning correlated with striatal activation. These results elucidate how fear is readjusted to appropriately track environmental changes, and the brain mechanisms underlying the flexible control of fear.
Takahashi, Y. (2008). Silencing the critics: understanding the effects of cocaine sensitization on dorsolateral and ventral striatum in the context of an actor/critic model. Frontiers in Neuroscience, 2(1), 86–99. https://doi.org/10.3389/neuro.01.014.2008 (Original work published 2008)
A critical problem in daily decision making is how to choose actions now in order to bring about rewards later. Indeed, many of our actions have long-term consequences, and it is important to not be myopic in balancing the pros and cons of different options, but rather to take into account both immediate and delayed consequences of actions. Failures to do so may be manifest as persistent, maladaptive decision-making, one example of which is addiction where behavior seems to be driven by the immediate positive experiences with drugs, despite the delayed adverse consequences. A recent study by Takahashi et al. (2007) investigated the effects of cocaine sensitization on decision making in rats and showed that drug use resulted in altered representations in the ventral striatum and the dorsolateral striatum, areas that have been implicated in the neural instantiation of a computational solution to optimal long-term actions selection called the Actor/Critic framework. In this Focus article we discuss their results and offer a computational interpretation in terms of drug-induced impairments in the Critic. We first survey the different lines of evidence linking the subparts of the striatum to the Actor/Critic framework, and then suggest two possible scenarios of breakdown that are suggested by Takahashi et al.'s (2007) data. As both are compatible with the current data, we discuss their different predictions and how these could be empirically tested in order to further elucidate (and hopefully inch towards curing) the neural basis of drug addiction.

2007

Niv, Y., Daw, N., Joel, D., & Dayan, P. (2007). Tonic dopamine: Opportunity costs and the control of response vigor. Psychopharmacology, 191(3), 507–520. https://doi.org/10.1007/s00213-006-0502-4
RATIONALE: Dopamine neurotransmission has long been known to exert a powerful influence over the vigor, strength, or rate of responding. However, there exists no clear understanding of the computational foundation for this effect; predominant accounts of dopamine's computational function focus on a role for phasic dopamine in controlling the discrete selection between different actions and have nothing to say about response vigor or indeed the free-operant tasks in which it is typically measured. OBJECTIVES: We seek to accommodate free-operant behavioral tasks within the realm of models of optimal control and thereby capture how dopaminergic and motivational manipulations affect response vigor. METHODS: We construct an average reward reinforcement learning model in which subjects choose both which action to perform and also the latency with which to perform it. Optimal control balances the costs of acting quickly against the benefits of getting reward earlier and thereby chooses a best response latency. RESULTS: In this framework, the long-run average rate of reward plays a key role as an opportunity cost and mediates motivational influences on rates and vigor of responding. We review evidence suggesting that the average reward rate is reported by tonic levels of dopamine putatively in the nucleus accumbens. CONCLUSIONS: Our extension of reinforcement learning models to free-operant tasks unites psychologically and computationally inspired ideas about the role of tonic dopamine in striatum, explaining from a normative point of view why higher levels of dopamine might be associated with more vigorous responding.
Niv, Y., & Rivlin-Etzion, M. (2007). Parkinson’s Disease: Fighting the Will?. Journal of Neuroscience, 27(44), 11777–11779. https://doi.org/10.1523/JNEUROSCI.4010-07.2007 (Original work published 2007)
A phenomenon familiar to clinicians treating patients with Parkinson's disease (PD) is kinesia paradoxica: astonishing displays of sudden mobility and agility by otherwise akinetic PD patients in instances of emergency (for instance,
Niv, Y. (2007). Cost, benefit, tonic, phasic: What do response rates tell us about dopamine and motivation?. Annals of the New York Academy of Sciences, 1104, 357–376. https://doi.org/10.1196/annals.1390.018
The role of dopamine in decision making has received much attention from both the experimental and computational communities. However, because reinforcement learning models concentrate on discrete action selection and on phasic dopamine signals, they are silent as to how animals decide upon the rate of their actions, and they fail to account for the prominent effects of dopamine on response rates. We suggest an extension to reinforcement learning models in which response rates are optimally determined by balancing the tradeoff between the cost of fast responding and the benefit of rapid reward acquisition. The resulting behavior conforms well with numerous characteristics of free-operant responding. More importantly, this framework highlights a role for a tonic signal corresponding to the net rate of rewards, in determining the optimal rate of responding. We hypothesize that this critical quantity is conveyed by tonic levels of dopamine, explaining why dopaminergic manipulations exert a global affect on response rates. We further suggest that the effects of motivation on instrumental rates of responding are mediated through its influence on the net reward rate, implying a tight coupling between motivational states and tonic dopamine. The relationships between phasic and tonic dopamine signaling, and between directing and energizing effects of motivation, as well as the implications for motivational control of habitual and goal-directed instrumental action selection, are discussed. ©2007 New York Academy of Sciences.
Niv, Y. (2007). The effects of motivation on habitual instrumental behavior (pp. 478–525) [The Hebrew University of Jerusalem]. https://doi.org/10.1037/10790-013
This thesis provides a normative computational analysis of how motivation affects decision making. More specifically, we provide a reinforcement learning model of optimal self-paced (free-operant) learning and behavior, and use it to address three broad classes of questions: (1) Why do animals work harder in some instrumental tasks than in others? (2) How do motivational states affect responding in such tasks, particu- larly in those cases in which behavior is habitual, that is, when responding is insensitive to changes in the specific worth of its goals, such as a higher value of food when hungry rather than sated? and (3) Why do dopaminergic manipulations cause global changes in the vigor of responding, and how is this related to prominent accounts of the role of dopamine in providing basal ganglia and frontal cortical areas with a reward prediction error signal that can be used for learning to choose between actions? A fundamental question in behavioral neuroscience concerns the decision-making processes by which an- imals and humans select actions in the face of reward and punishment. In Chapter 1 we provide a brief overview of the current status of this research, focused on three themes: behavior, computation and neural substrates. In behavioral psychology, this question has been investigated through the paradigms of Pavlo- vian (classical) and instrumental (operant) conditioning, and much evidence has accumulated regarding the associations that control different aspects of learned behavior. The computational field of reinforcement learning has provided a normative framework within which conditioned behavior can be understood. In this, optimal action selection is based on predictions of long-run future consequences, such that decision making is aimed at maximizing rewards and minimizing punishment. Neuroscientific evidence from lesion studies, pharmacological manipulations and electrophysiological recordings in behaving animals have further pro- vided tentative links to neural structures underlying key computational constructs in these models. Most notably, much evidence suggests that the neuromodulator dopamine provides basal ganglia target structures with a reward prediction error that can influence learning and action selection, particularly in stimulus-driven habitual instrumental behavior. However, although reinforcement learning models have long promised to unify computational, psycholog- ical and neural accounts of appetitively conditioned behavior, we claim here that they suffer from a large theoretical oversight. While a bulk of data on animal conditioning comes from free-operant experiments measuring how fast animals will work for reinforcement, existing reinforcement learning models lack any notion of vigor or response rate, focusing instead only on competition between different responses, and so they are silent about these tasks. In Chapter 2 we first review the basic characteristics of free-operant behavior, illustrating the effects of reinforcement schedules on rates of responding. We then develop a rein- forcement learning model in which vigor selection is optimized together with response selection. The model suggests that subjects choose how vigorously to perform selected actions by optimally balancing the costs and benefits of different speeds of responding. Importantly, we show that this model accounts normatively for effects of reinforcement schedules on response rates, such as the fact that responding on ratio schedules is faster than responding on interval schedules that yield the same rate of reinforcement. Finally, the model highlights the importance of the net rate of rewards in quantifying the opportunity cost of time, and thus in determining response vigor. In Chapter 3 we flesh out the implications of this model for the motivational control of habitual behavior. In general, understanding the effects of motivation on instrumental action selection is fundamental to the study of decision making. Recent work has shown that motivational control can be used to divide instrumental behavior into two classes: ‘goal-directed' behavior is immediately sensitive to motivation-induced changes in the values of its specific consequences, while ‘habitual' behavior is not. Because habitual behavior constitutes a large proportion of our daily activities, it is thus important to ask how does motivation affect habitual behavior? That is, how can habitual behavior be performed such as to achieve motivationally relevant goals? In Chapter 3 we flesh out the implications of this model for the motivational control of habitual behavior. In general, understanding the effects of motivation on instrumental action selection is fundamental to the study of decision making. Recent work has shown that motivational control can be used to divide instrumental behavior into two classes: ‘goal-directed' behavior is immediately sensitive to motivation-induced changes in the values of its specific consequences, while ‘habitual' behavior is not. Because habitual behavior constitutes a large proportion of our daily activities, it is thus important to ask how does motivation affect habitual behavior? That is, how can habitual behavior be performed such as to achieve motivationally relevant goals? We start by defining motivation as a mapping from outcomes to utilities. Incorporating this into the com- putational framework of optimal response rates, we show that in general, the optimal effects of motivation on behavior should be two-fold: On the one hand, motivation should affect the choice between actions such that actions leading to those outcomes that are more highly valued are more probable. This corresponds to the traditional directing effect of motivation. On the other hand, by influencing the opportunity cost of time, motivation should affect the response rates of all chosen actions, irrespective of their specific outcomes, as in the decades-old (but controversial) idea that motivation energizes behavior. This global effect of moti- vation explains not only why hungry rats work harder for food, but also sheds light on the counterintuitive observation that they will sometimes work harder for water. Based on the computational view of habitual behavior as arising from cached values summarizing long-run reward predictions, we suggest that habitual action selection can direct responding properly only in those motivational states which pertained during behavioral training. However, this does not imply insensitivity to novel motivational states. In these, we propose that the outcome-independent, global effects of motivational can ‘energize' habitual actions, as a well-founded approximation to the optimal solution in a trained situation. That is, habitual response rates can be adjusted to the current motivational state, in a way that is optimal given the computational limitations of the habitual system. Our computational framework suggests that the energizing function of motivation is mediated by the ex- pected net rate of rewards. In Chapter 4, we put forth the hypothesis that this important quantity is reported by tonic levels of dopamine. Dopamine neurotransmission has long been known to exert a powerful influ- ence over the vigor, strength or rate of responding. However, there exists no clear understanding of the computational foundation for this effect. Previous reinforcement learning models of habitual behavior have indeed suggested an interpretation of the function of dopaminergic signals in the brain. However, these have concentrated only on the role of precisely timed phasic dopaminergic signals in learning the predictive value of different actions, and have ignored both tonic dopamine transmission and response vigor. Our tonic dopamine hypothesis focuses on the involvement of dopamine in the control of vigor, explaining why higher levels of dopamine are associated with globally higher response rates, ie, why, like motivation, dopamine ‘energizes' behavior. In this way, through the computational framework of optimal choice of response rates, we suggest an explanation of the motivational control of habitual behavior, on both the behavioral and the neural levels. Reinforcement learning models of animal learning are appealing not only because they provide a normative basis for decision-making, but also because they show that optimal action selection can be learned through online incremental experience with the environment, using only locally available information. To complete the picture of how dopamine influences free-operant learning and behavior, in Chapter 5 we describe an online algorithm of the type usually associated with instrumental learning and decision-making, which is suitable for learning to select actions and latencies according to our new framework. There are two major differences between learning in our model and previous online reinforcement learning algorithms: First, most prior applications have dealt with discounted reinforcement learning while we use average reward reinforcement learning. Second, unlike previous models that have focused on discrete action selection, the action space in our model is inherently continuous, as it includes a choice of response latency. We thus propose a new online learning algorithm that is specifically suitable for our needs. In this, building on the experimental characteristics of response latencies, we suggest a functional parameterization of the action space that drastically reduces the complexity of learning. Moreover, we suggest a formulation of online action selection in which response rates are directly affected by the net reward rate. We show that our algorithm learns to respond appropriately, and with nearly optimal latencies, and discuss its implications for the differences between the learning of interval and ratio schedules. In Chapter 6, the last of the main chapters, we deviate from the theoretical analysis of behavior, to describe two instrumental condi

2006

Niv, Y., Daw, N., & Dayan, P. (2006). Choice values. Nature Neuroscience, 9(8), 987–988. https://doi.org/10.1038/nn0806-987
Dopaminergic neurons are thought to inform decisions by reporting errors in reward prediction. A new study reports dopaminergic responses as monkeys make choices, supporting one computational theory of appetitive learning.
Niv, Y., Edlund, J., Dayan, P., & O’Doherty, J. (2006). Neural correlates of risk-sensitivity: An fMRI study of instrumental choice behavior. Society for Neuroscience Abstracts.
Dayan, P., Niv, Y., Seymour, B., & Daw, N. (2006). The misbehavior of value and the discipline of the will. Neural Networks, 19(8), 1153–1160. https://doi.org/10.1016/j.neunet.2006.03.002
Most reinforcement learning models of animal conditioning operate under the convenient, though fictive, assumption that Pavlovian conditioning concerns prediction learning whereas instrumental conditioning concerns action learning. However, it is only through Pavlovian responses that Pavlovian prediction learning is evident, and these responses can act against the instrumental interests of the subjects. This can be seen in both experimental and natural circumstances. In this paper we study the consequences of importing this competition into a reinforcement learning context, and demonstrate the resulting effects in an omission schedule and a maze navigation task. The misbehavior created by Pavlovian values can be quite debilitating; we discuss how it may be disciplined. ©2006 Elsevier Ltd. All rights reserved.
Daw, N., Niv, Y., Dayan, P., & Bezard, . (2006). Actions, Policies, Values, and the Basal Ganglia. In Recent Breakthroughs in Basal Ganglia Research (pp. 111–130). Nova Science Publishers Inc. PDF: Actions, Policies, Values, and the Basal Ganglia
Niv, Y., Joel, D., & Dayan, P. (2006). A normative perspective on motivation. Trends in Cognitive Science, 10(8), 375–381. https://doi.org/10.1016/j.tics.2006.06.010
Understanding the effects of motivation on instrumental action selection, and specifically on its two main forms, goal-directed and habitual control, is fundamental to the study of decision making. Motivational states have been shown to 'direct' goal-directed behavior rather straightforwardly towards more valuable outcomes. However, how motivational states can influence outcome-insensitive habitual behavior is more mysterious. We adopt a normative perspective, assuming that animals seek to maximize the utilities they achieve, and viewing motivation as a mapping from outcomes to utilities. We suggest that habitual action selection can direct responding properly only in motivational states which pertained during behavioral training. However, in novel states, we propose that outcome-independent, global effects of the utilities can 'energize' habitual actions.

2005

Niv, Y., Daw, N., Dayan, P., Weiss, ., Schölkopf, ., & Platt, . (2005). How fast to work: response vigor, motivation and tonic dopamine. Neural Information Processing Systems, 18, 1019–1026. https://doi.org/10.1007/s00213-006-0502-4
Reinforcement learning models have long promised to unify computa- tional, psychological and neural accounts of appetitively conditioned be- havior. However, the bulk of data on animal conditioning comes from free-operant experiments measuring how fast animals will work for rein- forcement. Existing reinforcement learning (RL) models are silent about these tasks, because they lack any notion of vigor. They thus fail to ad- dress the simple observation that hungrier animals will work harder for food, as well as stranger facts such as their sometimes greater produc- tivity even when working for irrelevant outcomes such as water. Here, we develop an RL framework for free-operant behavior, suggesting that subjects choose how vigorously to perform selected actions by optimally balancing the costs and benefits of quick responding. Motivational states such as hunger shift these factors, skewing the tradeoff. This accounts normatively for the effects of motivation on response rates, as well as many other classic findings. Finally, we suggest that tonic levels of dopamine may be involved in the computation linking motivational state to optimal responding, thereby explaining the complex vigor-related ef- fects of pharmacological manipulation of dopamine.
Niv, Y., Duff, M., & Dayan, P. (2005). Dopamine, uncertainty and TD learning. Behavioral and Brain Functions, 1, 6. https://doi.org/10.1186/1744-9081-1-6
Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively. However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric. We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors. In particular, we show that when averaging the non-stationary prediction errors across trials, a ramping in the activity of the dopamine neurons should be apparent, whose magnitude is dependent on the learning rate. This exact phenomenon was observed in a recent experiment, though being interpreted there in antipodal terms as a within-trial encoding of uncertainty.
Daw, N., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8(12), 1704–1711. https://doi.org/10.1038/nn1560
A broad range of neural and behavioral data suggests that the brain contains multiple systems for behavioral choice, including one associated with prefrontal cortex and another with dorsolateral striatum. However, such a surfeit of control raises an additional choice problem: how to arbitrate between the systems when they disagree. Here, we consider dual-action choice systems from a normative perspective, using the computational theory of reinforcement learning. We identify a key trade-off pitting computational simplicity against the flexible and statistically efficient use of experience. The trade-off is realized in a competition between the dorsolateral striatal and prefrontal systems. We suggest a Bayesian principle of arbitration between them according to uncertainty, so each controller is deployed when it should be most accurate. This provides a unifying account of a wealth of experimental evidence about the factors favoring dominance by either system.

2002

Reinforcement learning is a fundamental process by which organisms learn to achieve goals from their interactions with the environment. Using evolutionary computation techniques we evolve (near-)optimal neuronal learning rules in a simple neural network model of reinforcement learning in bumblebees foraging for nectar. The resulting neural networks exhibit efficient reinforcement learning, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels and to the well-documented choice strategies of risk aversion and probability matching. Additionally, risk aversion is shown to emerge even when bees are evolved in a completely risk-less environment. In contrast to existing theories in economics and game theory, risk-averse behavior is shown to be a direct consequence of (near-)optimal reinforcement learning, without requiring additional assumptions such as the existence of a nonlinear subjective utility function for rewards. Our results are corroborated by a rigorous mathematical analysis, and their robustness in real-world situations is supported by experiments in a mobile robot. Thus we provide a biologically founded, parsimonious, and novel explanation for risk aversion and probability matching.
Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: new anatomical and computational perspectives. Neural Networks, 15(4-6), 535–547. https://doi.org/10.1016/S0893-6080(02)00047-3
A large number of computational models of information processing in the basal ganglia have been developed in recent years. Prominent in these are actor-critic models of basal ganglia functioning, which build on the strong resemblance between dopamine neuron activity and the temporal difference prediction error signal in the critic, and between dopamine-dependent long-term synaptic plasticity in the striatum and learning guided by a prediction error signal in the actor. We selectively review several actor-critic models of the basal ganglia with an emphasis on two important aspects: the way in which models of the critic reproduce the temporal dynamics of dopamine firing, and the extent to which models of the actor take into account known basal ganglia anatomy and physiology. To complement the efforts to relate basal ganglia mechanisms to reinforcement learning (RL), we introduce an alternative approach to modeling a critic network, which uses Evolutionary Computation techniques to 'evolve' an optimal RL mechanism, and relate the evolved mechanism to the basic model of the critic. We conclude our discussion of models of the critic by a critical discussion of the anatomical plausibility of implementations of a critic in basal ganglia circuitry, and conclude that such implementations build on assumptions that are inconsistent with the known anatomy of the basal ganglia. We return to the actor component of the actor-critic model, which is usually modeled at the striatal level with very little detail. We describe an alternative model of the basal ganglia which takes into account several important, and previously neglected, anatomical and physiological characteristics of basal ganglia-thalamocortical connectivity and suggests that the basal ganglia performs reinforcement-biased dimensionality reduction of cortical inputs. We further suggest that since such selective encoding may bias the representation at the level of the frontal cortex towards the selection of rewarded plans and actions, the reinforcement-driven dimensionality reduction framework may serve as a basis for basal ganglia actor models. We conclude with a short discussion of the dual role of the dopamine signal in RL and in behavioral switching. Copyright ©2002 Elsevier Science Ltd.
Niv, Y., Joel, D., Meilijson, I., & Ruppin, E. (2002). Evolution of reinforcement learning in foraging bees: A simple explanation for risk averse behavior. Neurocomputing, 44-46, 951–956. https://doi.org/10.1016/S0925-2312(02)00496-4
Reinforcement learning is a fundamental process by which organisms learn to achieve goals from their interactions with the environment. We use evolutionary computation techniques to derive (near-)optimal neuronal learning rules in a simple neural network model of decision-making in simulated bumblebees foraging for nectar. The resulting bees exhibit efficient reinforcement learning. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels and to the well-documented foraging strategy of risk aversion. This behavior is shown to emerge directly from optimal reinforcement learning, providing a biologically founded, parsimonious and novel explanation of risk-averse behavior. ©2002 Published by Elsevier Science B.V.

2001

©Springer-Verlag Berlin Heidelberg 2001.Reinforcement learning (RL) is a fundamental process by which organisms learn to achieve a goal from interactions with the environment. Using Artificial Life techniques we derive (near-)optimal neuronal learning rules in a simple neural network model of decision-making in simulated bumblebees foraging for nectar. The resulting networks exhibit efficient RL, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels from which emerge the welldocumented foraging strategies of risk aversion and probability matching. These are shown to be a direct result of optimal RL, providing a biologically founded, parsimonious and novel explanation for these behaviors. Our results are corroborated by a rigorous mathematical analysis and by experiments in mobile robots.

#

Zorowitz, S., & Niv, Y. Improving the reliability of cognitive task measures: A narrative review. https://doi.org/10.1016/j.bpsc.2023.02.004 (Original work published 2023)

Cognitive tasks are capable of providing researchers with crucial insights into the relationship between cognitive processing and psychiatric phenomena. However, many recent studies have found that task measures exhibit poor reliability, which hampers their usefulness for individual-differences research. Here we provide a narrative review of approaches to improve the reliability of cognitive task measures. Specifically, we introduce a taxonomy of experiment design and analysis strategies for improving task reliability. Where appropriate, we highlight studies that are exemplary for improving the reliability of specific task measures. We hope that this article can serve as a helpful guide for experimenters who wish to design a new task, or improve an existing one, to achieve sufficient reliability for use in individual-differences research.

PDF