Publications

Submitted
Bennett, D., Davidson, G., & Niv, Y. (Submitted). A model of mood as integrated advantage. PreprintAbstract
Mood is an integrative and diffuse affective state that is thought to exert a pervasive effect on cognition and behavior. At the same time, mood itself is thought to fluctuate slowly as a product of feedback from interactions with the environment. Here we present a new computational theory of the valence of mood—the Integrated Advantage model—that seeks to account for this bidirectional interaction. Adopting theoretical formalisms from reinforcement learning, we propose to conceptualize the valence of mood as a leaky integral of an agent’s appraisals of the Advantage of its actions. This model generalizes and extends previous models of mood wherein affective valence was conceptualized as a moving average of reward prediction errors. We give a full theoretical derivation of the Integrated Advantage model and provide a functional explanation of how an integrated-Advantage variable could be deployed adaptively by a biological agent to accelerate learning in complex and/or stochastic environments. Specifically, drawing on stochastic optimization theory, we propose that an agent can utilize our hypothesized form of mood to approximate a momentum-based update to its behavioral policy, thereby facilitating rapid learning of optimal actions. We then show how this model of mood provides a principled and parsimonious explanation for a number of contextual effects on mood from the affective science literature, including expectation- and surprise-related effects, counterfactual effects from information about foregone alternatives, action-typicality effects, and action/inaction asymmetry.
2020
Drummond, N., & Niv, Y. (2020). Model-based decision making and model-free learning. Current Biology , 30 (15), 860-865. Publisher's VersionAbstract
Free will is anything but free. With it comes the onus of choice: not only what to do, but which inner voice to listen to — our ‘automatic’ response system, which some consider ‘impulsive’ or ‘irrational’, or our supposedly more rational deliberative one. Rather than a devil and angel sitting on our shoulders, research suggests that we have two decision-making systems residing in the brain, in our basal ganglia. Neither system is the devil and neither is irrational. They both have our best interests at heart and aim to suggest the best course of action calculated through rational algorithms. However, the algorithms they use are qualitatively different and do not always agree on which action is optimal. The rivalry between habitual, fast action and deliberative, purposeful action is an ongoing one.
Rouhani, N., Norman, K. A., Niv, Y., & Bornstein, A. M. (2020). Reward prediction errors create event boundaries in memory. Cognition. PDFAbstract
We remember when things change. Particularly salient are experiences where there is a change in rewards, eliciting reward prediction errors (RPEs). How do RPEs influence our memory of those experiences? One idea is that this signal directly enhances the encoding of memory. Another, not mutually exclusive, idea is that the RPE signals a deeper change in the environment, leading to the mnemonic separation of subsequent experiences from what came before, thereby creating a new latent context and a more separate memory trace. We tested this in four experiments where participants learned to predict rewards associated with a series of trial-unique images. High-magnitude RPEs indicated a change in the underlying distribution of rewards. To test whether these large RPEs created a new latent context, we first assessed recognition priming for sequential pairs that included a high-RPE event or not (Exp. 1: n = 27 & Exp. 2: n = 83). We found evidence of recognition priming for the high-RPE event, indicating that the high-RPE event is bound to its predecessor in memory. Given that high-RPE events are themselves preferentially remembered (Rouhani, Norman, & Niv, 2018), we next tested whether there was an event boundary across a high-RPE event (i.e., excluding the high-RPE event itself; Exp. 3: n = 85). Here, sequential pairs across a high RPE no longer showed recognition priming whereas pairs within the same latent reward state did, providing initial evidence for an RPE-modulated event boundary. We then investigated whether RPE event boundaries disrupt temporal memory by asking participants to order and estimate the distance between two events that had either included a high-RPE event between them or not (Exp. 4). We found (n = 49) and replicated (n = 77) worse sequence memory for events across a high RPE. In line with our recognition priming results, we did not find sequence memory to be impaired between the high-RPE event and its predecessor, but instead found worse sequence memory for pairs across a high-RPE event. Moreover, greater distance between events at encoding led to better sequence memory for events across a low-RPE event, but not a high-RPE event, suggesting separate mechanisms for the temporal ordering of events within versus across a latent reward context. Altogether, these findings demonstrate that high-RPE events are both more strongly encoded, show intact links with their predecessor, and act as event boundaries that interrupt the sequential integration of events. We captured these effects in a variant of the Context Maintenance and Retrieval model (CMR; Polyn, Norman, & Kahana, 2009), modified to incorporate RPEs into the encoding process.
Cai, M. B., Shvartsman, M., Wu, A., Zhang, H., & Ju, X. (2020). Incorporating structured assumptions with probabilistic graphical models in fMRI data analysis. Neuropsychologia. Publisher's VersionAbstract
With the wide adoption of functional magnetic resonance imaging (fMRI) by cognitive neuroscience researchers, large volumes of brain imaging data have been accumulated in recent years. Aggregating these data to derive scientific insights often faces the challenge that fMRI data are high-dimensional, heterogeneous across people, and noisy. These challenges demand the development of computational tools that are tailored both for the neuroscience questions and for the properties of the data. We review a few recently developed algorithms in various domains of fMRI research: fMRI in naturalistic tasks, analyzing full-brain functional connectivity, pattern classification, inferring representational similarity and modeling structured residuals. These algorithms all tackle the challenges in fMRI similarly: they start by making clear statements of assumptions about neural data and existing domain knowledge, incorporate those assumptions and domain knowledge into probabilistic graphical models, and use those models to estimate properties of interest or latent structures in the data. Such approaches can avoid erroneous findings, reduce the impact of noise, better utilize known properties of the data, and better aggregate data across groups of subjects. With these successful cases, we advocate wider adoption of explicit model construction in cognitive neuroscience. Although we focus on fMRI, the principle illustrated here is generally applicable to brain data of other modalities.
Langdon, A. J., & Daw, N. (2020). Beyond the Average View of Dopamine. Trends in Cognitive Sciences. PDFAbstract
Dopamine (DA) responses are synonymous with the ‘reward prediction error’ of reinforcement learning (RL), and are thought to update neural estimates of expected value. A recent study by Dabney et al. enriches this picture, demonstrating that DA neurons track variability in rewards, providing a readout of risk in the brain.
Sharpe, M. J., Batchelor, H. M., Mueller, L. E., Chang, C. Y., Maes, E. J. P., Niv, Y., & Schoenbaum, G. (2020). Dopamine transients do not act as model-free prediction errors during associative learning. Nature Communications , 11 (1), 106. Publisher's Version
2019
Bravo-Hermsdorff, G., Felso, V., Ray, E., Gunderson, L. M., Helander, M. E., Maria, J., & Niv, Y. (2019). Gender and collaboration patterns in a temporal scientific authorship network. Applied Network Science , 4 (1), 112. PDFAbstract
One can point to a variety of historical milestones for gender equality in STEM (science, technology, engineering, and mathematics), however, practical effects are incremental and ongoing. It is important to quantify gender differences in subdomains of scientific work in order to detect potential biases and monitor progress. In this work, we study the relevance of gender in scientific collaboration patterns in the Institute for Operations Research and the Management Sciences (INFORMS), a professional society with sixteen peer-reviewed journals. Using their publication data from 1952 to 2016, we constructed a large temporal bipartite network between authors and publications, and augmented the author nodes with gender labels. We characterized differences in several basic statistics of this network over time, highlighting how they have changed with respect to relevant historical events. We find a steady increase in participation by women (e.g., fraction of authorships by women and of new women authors) starting around 1980. However, women still comprise less than 25% of the INFORMS society and an even smaller fraction of authors with many publications. Moreover, we describe a methodology for quantifying the structural role of an authorship with respect to the overall connectivity of the network, using it to measure subtle differences between authorships by women and by men. Specifically, as measures of structural importance of an authorship, we use effective resistance and contraction importance, two measures related to diffusion throughout a network. As a null model, we propose a degree-preserving temporal and geometric network model with emergent communities. Our results suggest the presence of systematic differences between the collaboration patterns of men and women that cannot be explained by only local statistics.
Niv, Y. (2019). Learning task-state representations. Nature Neuroscience , 22 (10), 1544–1553. PDFAbstract
Arguably, the most difficult part of learning is deciding what to learn about. Should I associate the positive outcome of safely completing a street-crossing with the situation ‘the car approaching the crosswalk was red' or with ‘the approaching car was slowing down'? In this Perspective, we summarize our recent research into the computational and neural underpinnings of ‘representation learning'—how humans (and other animals) construct task representations that allow efficient learning and decision-making. We first discuss the problem of learning what to ignore when confronted with too much information, so that experience can properly generalize across situations. We then turn to the problem of augmenting perceptual information with inferred latent causes that embody unobservable task-relevant information, such as contextual knowledge. Finally, we discuss recent findings regarding the neural substrates of task representations that suggest the orbitofrontal cortex represents ‘task states', deploying them for decision-making and learning elsewhere in the brain.
Schuck, N. W., & Niv, Y. (2019). Sequential replay of nonspatial task states in the human hippocampus. Science. PDFAbstract
\textlessp\textgreaterSequential neural activity patterns related to spatial experiences are “replayed” in the hippocampus of rodents during rest. We investigated whether replay of nonspatial sequences can be detected noninvasively in the human hippocampus. Participants underwent functional magnetic resonance imaging (fMRI) while resting after performing a decision-making task with sequential structure. Hippocampal fMRI patterns recorded at rest reflected sequentiality of previously experienced task states, with consecutive patterns corresponding to nearby states. Hippocampal sequentiality correlated with the fidelity of task representations recorded in the orbitofrontal cortex during decision-making, which were themselves related to better task performance. Our findings suggest that hippocampal replay may be important for building representations of complex, abstract tasks elsewhere in the brain and establish feasibility of investigating fast replay signals with fMRI.\textless/p\textgreater
Rouhani, N., & Niv, Y. (2019). Depressive symptoms bias the prediction-error enhancement of memory towards negative events in reinforcement learning. Psychopharmacology , 236 (8), 2425–2435. PDFAbstract
Rationale. Depression is a disorder characterized by sustained negative affect and blunted positive affect, suggesting potential abnormalities in reward learning and its interaction with episodic memory. Objectives. This study investigated how reward prediction errors experienced during learning modulate memory for rewarding events in individuals with depressive and non-depressive symptoms. Methods. Across three experiments, participants learned the average values of two scene categories in two learning contexts. Each learning context had either high or low outcome variance, allowing us to test the effects of small and large prediction errors on learning and memory. Participants were later tested for their memory of trial-unique scenes that appeared alongside outcomes. We compared learning and memory performance of individuals with self-reported depressive symptoms (N = 101) to those without (N = 184). Results. Although there were no overall differences in reward learning between the depressive and non-depressive group, depression severity within the depressive group predicted greater error in estimating the values of the scene categories. Similarly, there were no overall differences in memory performance. However, in depressive participants, negative prediction errors enhanced episodic memory more so than did positive prediction errors, and vice versa for non-depressive participants who showed a larger effect of positive prediction errors on memory. These results reflected differences in memory both within group and across groups. Conclusions. Individuals with self-reported depressive symptoms showed relatively intact reinforcement learning, but demonstrated a bias for encoding events that accompanied surprising negative outcomes versus surprising positive ones. We discuss a potential neural mechanism supporting these effects, which may underlie or contribute to the excessive negative affect observed in depression.
Sharpe, M. J., Batchelor, H. M., Mueller, L. E., Chang, C. Y., Maes, E. J. P., Niv, Y., & Schoenbaum, G. (2019). Dopamine transients delivered in learning contexts do not act as model-free prediction errors. bioRxiv. PDFAbstract
Dopamine neurons fire transiently in response to unexpected rewards. These neural correlates are proposed to signal the reward prediction error described in model-free reinforcement learning algorithms. This error term represents the unpredicted or excess value of the rewarding event. In model-free reinforcement learning, this value is then stored as part of the learned value of any antecedent cues, contexts or events, making them intrinsically valuable, independent of the specific rewarding event that caused the prediction error. In support of equivalence between dopamine transients and this model-free error term, proponents cite causal optogenetic studies showing that artificially induced dopamine transients cause lasting changes in behavior. Yet none of these studies directly demonstrate the presence of cached value under conditions appropriate for associative learning. To address this gap in our knowledge, we conducted three studies where we optogenetically activated dopamine neurons while rats were learning associative relationships, both with and without reward. In each experiment, the antecedent cues failed to acquired value and instead entered into value-independent associative relationships with the other cues or rewards. These results show that dopamine transients, constrained within appropriate learning situations, support valueless associative learning.
Radulescu, A., Niv, Y., & Ballard, I. (2019). Holistic Reinforcement Learning: The Role of Structure and Attention. Trends in Cognitive Sciences. PDFAbstract
Compact representations of the environment allow humans to behave efficiently in a complex world. Reinforcement learning models capture many behavioral and neural effects but do not explain recent findings showing that structure in the environment influences learning. In parallel, Bayesian cognitive models predict how humans learn structured knowledge but do not have a clear neurobiological implementation. We propose an integration of these two model classes in which structured knowledge learned via approximate Bayesian inference acts as a source of selective attention. In turn, selective attention biases reinforcement learning towards relevant dimensions of the environment. An understanding of structure learning will help to resolve the fundamental challenge in decision science: explaining why people make the decisions they do.
McDougle, S. D., Butcher, P. A., Parvin, D. E., Mushtaq, F., Niv, Y., Ivry, R. B., & Taylor, J. A. (2019). Neural Signatures of Prediction Errors in a Decision-Making Task Are Modulated by Action Execution Failures. Current Biology. PDFAbstract
Decisions must be implemented through actions, and actions are prone to error. As such, when an expected outcome is not obtained, an individual should be sensitive to not only whether the choice itself was suboptimal but also whether the action required to indicate that choice was executed successfully. The intelligent assignment of credit to action execution versus action selection has clear ecological utility for the learner. To explore this, we used a modified version of a classic reinforcement learning task in which feedback indicated whether negative prediction errors were, or were not, associated with execution errors. Using fMRI, we asked if prediction error computations in the human striatum, a key substrate in reinforcement learning and decision making, are modulated when a failure in action execution results in the negative outcome. Participants were more tolerant of non-rewarded outcomes when these resulted from execution errors versus when execution was successful, but reward was withheld. Consistent with this behavior, a model-driven analysis of neural activity revealed an attenuation of the signal associated with negative reward prediction errors in the striatum following execution failures. These results converge with other lines of evidence suggesting that prediction errors in the mesostriatal dopamine system integrate high-level information during the evaluation of instantaneous reward outcomes.
Zhou, J., Gardner, M. P. H., Stalnaker, T. A., Ramus, S. J., Wikenheiser, A. M., Niv, Y., & Schoenbaum, G. (2019). Rat Orbitofrontal Ensemble Activity Contains Multiplexed but Dissociable Representations of Value and Task Structure in an Odor Sequence Task. Current Biology , 29 (6), 897–907.e3. PDFAbstract
The orbitofrontal cortex (OFC) has long been implicated in signaling information about expected outcomes to facilitate adaptive or flexible behavior. Current proposals focus on signaling of expected value versus the representation of a value-agnostic cognitive map of the task. While often suggested as mutually exclusive, these alternatives may represent extreme ends of a continuum determined by task complexity and experience. As learning proceeds, an initial, detailed cognitive map might be acquired, based largely on external information. With more experience, this hypothesized map can then be tailored to include relevant abstract hidden cognitive constructs. The map would default to an expected value in situations where other attributes are largely irrelevant, but, in richer tasks, a more detailed structure might continue to be represented, at least where relevant to behavior. Here, we examined this by recording single-unit activity from the OFC in rats navigating an odor sequence task analogous to a spatial maze. The odor sequences provided a mappable state space, with 24 unique “positions” defined by sensory information, likelihood of reward, or both. Consistent with the hypothesis that the OFC represents a cognitive map tailored to the subjects' intentions or plans, we found a close correspondence between how subjects were using the sequences and the neural representations of the sequences in OFC ensembles. Multiplexed with this value-invariant representation of the task, we also found a representation of the expected value at each location. Thus, the value and task structure co-existed as dissociable components of the neural code in OFC.
Langdon, A. J., Hathaway, B. A., Zorowitz, S., Harris, C. B. W., & Winstanley, C. A. (2019). Relative insensitivity to time-out punishments induced by win-paired cues in a rat gambling task. Psychopharmacology , 236 (8), 2543–2556. PDFAbstract
Rationale. Pairing rewarding outcomes with audiovisual cues in simulated gambling games increases risky choice in both humans and rats. However, the cognitive mechanism through which this sensory enhancement biases decision-making is unknown. Objectives. To assess the computational mechanisms that promote risky choice during gambling, we applied a series of reinforcement learning models to a large dataset of choices acquired from rats as they each performed one of two variants of a rat gambling task (rGT), in which rewards on “win” trials were delivered either with or without salient audiovisual cues. Methods. We used a sampling technique based on Markov chain Monte Carlo to obtain posterior estimates of model parameters for a series of RL models of increasing complexity, in order to assess the relative contribution of learning about positive and negative outcomes to the latent valuation of each choice option on the cued and uncued rGT. Results. Rats which develop a preference for the risky options on the rGT substantially down-weight the equivalent cost of the time-out punishments during these tasks. For each model tested, the reduction in learning from the negative time-outs correlated with the degree of risk preference in individual rats. We found no apparent relationship between risk preference and the parameters that govern learning from the positive rewards. Conclusions. The emergence of risk-preferring choice on the rGT derives from a relative insensitivity to the cost of the time-out punishments, as opposed to a relative hypersensitivity to rewards. This hyposensitivity to punishment is more likely to be induced in individual rats by the addition of salient audiovisual cues to rewards delivered on win trials.
Cai, M. B., Schuck, N. W., Pillow, J. W., & Niv, Y. (2019). Representational structure or task structure? Bias in neural representational similarity analysis and a Bayesian method for reducing bias. PLoS computational biology. PDFAbstract
The activity of neural populations in the brains of humans and animals can exhibit vastly different spatial patterns when faced with different tasks or environmental stimuli. The degrees of similarity between these neural activity patterns in response to different events are used to characterize the representational structure of cognitive states in a neural population. The dominant methods of investigating this similarity structure first estimate neural activity patterns from noisy neural imaging data using linear regression, and then examine the similarity between the estimated patterns. Here, we show that this approach introduces spurious bias structure in the resulting similarity matrix, in particular when applied to fMRI data. This problem is especially severe when the signal-to-noise ratio is low and in cases where experimental conditions cannot be fully randomized in a task. We propose Bayesian Representational Similarity Analysis (BRSA), an alternative method for computing representational similarity, in which we treat the covariance structure of neural activity patterns as a hyper-parameter in a generative model of the neural data. By marginalizing over the unknown activity patterns, we can directly estimate this covariance structure from imaging data. This method offers significant reductions in bias and allows estimation of neural representational similarity with previously unattained levels of precision at low signal-to-noise ratio, without losing the possibility of deriving an interpretable distance measure from the estimated similarity. The method is closely related to Pattern Component Model (PCM), but instead of modeling the estimated neural patterns as in PCM, BRSA models the imaging data directly and is suited for analyzing data in which the order of task conditions is not fully counterbalanced. The probabilistic framework allows for jointly analyzing data from a group of participants. The method can also simultaneously estimate a signal-to-noise ratio map that shows where the learned representational structure is supported more strongly. Both this map and the learned covariance matrix can be used as a structured prior for maximum a posteriori estimation of neural activity patterns, which can be further used for fMRI decoding. Our method therefore paves the way towards a more unified and principled analysis of neural representations underlying fMRI signals. We make our tool freely available in Brain Imaging Analysis Kit (BrainIAK).
Radulescu, A., & Niv, Y. (2019). State representation in mental illness. Current Opinion in Neurobiology. PDFAbstract
Reinforcement learning theory provides a powerful set of computational ideas for modeling human learning and decision making. Reinforcement learning algorithms rely on state representations that enable efficient behavior by focusing only on aspects relevant to the task at hand. Forming such representations often requires selective attention to the sensory environment, and recalling memories of relevant past experiences. A striking range of psychiatric disorders, including bipolar disorder and schizophrenia, involve changes in these cognitive processes. We review and discuss evidence that these changes can be cast as altered state representation, with the goal of providing a useful transdiagnostic dimension along which mental disorders can be understood and compared.
Bennett, D., Silverstein, S. M., & Niv, Y. (2019). The two cultures of computational psychiatry. JAMA Psychiatry. PDFAbstract
Translating advances in neuroscience into benefits for patients with mental illness presents enormous challenges because it involves both the most complex organ, the brain, and its interaction with a similarly complex environment. Dealing with such complexities demands powerful techniques. Computational psychiatry combines multiple levels and types of computation with multiple types of data in an effort to improve understanding, prediction and treatment of mental illness. Computational psychiatry, broadly defined, encompasses two complementary approaches: data driven and theory driven. Data-driven approaches apply machine-learning methods to high-dimensional data to improve classification of disease, predict treatment outcomes or improve treatment selection. These approaches are generally agnostic as to the underlying mechanisms. Theory-driven approaches, in contrast, use models that instantiate prior knowledge of, or explicit hypotheses about, such mechanisms, possibly at multiple levels of analysis and abstraction. We review recent advances in both approaches, with an emphasis on clinical applications, and highlight the utility of combining them.
Langdon, A. J., Song, M., & Niv, Y. (2019). Uncovering the ‘state': Tracing the hidden state representations that structure learning and decision-making. Behavioural Processes , 167, 103891. PDFAbstract
We review the abstract concept of a ‘state' – an internal representation posited by reinforcement learning theories to be used by an agent, whether animal, human or artificial, to summarize the features of the external and internal environment that are relevant for future behavior on a particular task. Armed with this summary representation, an agent can make decisions and perform actions to interact effectively with the world. Here, we review recent findings from the neurobiological and behavioral literature to ask: ‘what is a state?' with respect to the internal representations that organize learning and decision making across a range of tasks. We find that state representations include information beyond a straightforward summary of the immediate cues in the environment, providing timing or contextual information from the recent or more distant past, which allows these additional factors to influence decision making and other goal-directed behaviors in complex and perhaps unexpected ways.
2018
Niv, Y. (2018). Deep down, you are a scientist. In Think tank: Forty neuroscientists explore the biological roots of human experience. PDFAbstract
You may not know it, but deep down you are a scientist. To be precise, your brain is a scientist—and a good one, too: the kind of scientist that makes clear hypotheses, gathers data from several sources, and then reaches a well-founded conclusion. Although we are not aware of the scientific experimentation occurring in our brain on a momentary basis, the scientific process is fundamental to how our brain works. This scientific process involves three key components. First: hypotheses. Our brain makes hypotheses, or predictions, all the time. The second component of good scientific work is gathering data—testing the hypothesis by comparing it to evidence. The neuroscientists gather data to test the theories about how the brain works from several sources—for example, behavior, invasive recordings of the activity of single cells in the brain, and noninvasive imaging of overall activity in large areas of the brain. Finally, after making precise, well-founded predictions and gathering data from all available sources, a scientist must interpret the empirical observations. It is important to realize that the perceived reality is subjective—it is interpreted—rather than an objective image of the world out there. And in some cases this interpretation can break down. For instance, in schizophrenia, meaningless events and distractors can take on outsized meaning in subjective interpretation, leading to hallucinations, delusions, and paranoia. Our memories are similarly a reflection of our own interpretations rather than a true record of events. (PsycINFO Database Record (c) 2018 APA, all rights reserved)
Rouhani, N., Norman, K. A., & Niv, Y. (2018). Dissociable effects of surprising rewards on learning and memory. Journal of Experimental Psychology: Learning Memory and Cognition , 44 (9), 1430–1443. PDFAbstract
The extent to which rewards deviate from learned expectations is tracked by a signal known as a reward prediction error, but it is unclear how this signal interacts with episodic memory. Here, we investigated whether learning in a high-risk environment, with frequent large prediction errors, gives rise to higher fidelity memory traces than learning in a low-risk environment. In Experiment 1, we showed that higher magnitude prediction errors, positive or negative, improved recognition memory for trial-unique items. Participants also increased their learning rate after large prediction errors. In addition, there was an overall higher learning rate in the low-risk environment. Although unsigned prediction errors enhanced memory and increased learning rate, we did not find a relationship between learning rate and memory, suggesting that these two effects were due to separate underlying mechanisms. In Experiment 2, we replicated these results with a longer task that posed stronger memory demands and allowed for more learning. We also showed improved source and sequence memory for high-risk items. In Experiment 3, we controlled for the difficulty of learning in the two risk environments, again replicating the previous results. Moreover, equating the range of prediction errors in the two risk environments revealed that learning in a high-risk context enhanced episodic memory above and beyond the effect of prediction errors to individual items. In summary, our results across three studies showed that (absolute) prediction error magnitude boosted both episodic memory and incremental learning, but the two effects were not correlated, suggesting distinct underlying systems.
Sharpe, M. J., Chang, C. Y., Liu, M. A., Batchelor, H. M., Mueller, L. E., Jones, J. L., Niv, Y., et al. (2018). Dopamine transients are sufficient and necessary for acquisition of model-based associations. Nature Neuroscience , 21 (10), 1493. PDFAbstract
Learning to predict reward is thought to be driven by dopaminergic prediction errors, which reflect discrepancies between actual and expected value. Here the authors show that learning to predict neutral events is also driven by prediction errors and that such value-neutral associative learning is also likely mediated by dopaminergic error signals.
Sharpe, M. J., Stalnaker, T. A., Schuck, N. W., Killcross, S., Schoenbaum, G., & Niv, Y. (2018). An Integrated Model of Action Selection: Distinct Modes of Cortical Control of Striatal Decision Making. Annual Review of Psychology. PDFAbstract
Making decisions in environments with few choice options is easy. We select the action that results in the most valued outcome. Making decisions in more complex environments, where the same action can produce different outcomes in different conditions, is much harder. In such circumstances, we propose that accurate action selection relies on top-down control from the prelimbic and orbitofrontal cortices over striatal activity through distinct thalamostriatal circuits. We suggest that the prelimbic cortex exerts direct influence over medium spiny neurons in the dorsomedial striatum to represent the state space relevant to the current environment. Conversely, the orbitofrontal cortex is argued to track a subject's position within that state space, likely through modulation of cholinergic interneurons.
Langdon, A. J., Sharpe, M. J., Schoenbaum, G., & Niv, Y. (2018). Model-based predictions for dopamine. Current Opinion in Neurobiology , 49, 1–7. PDFAbstract
Phasic dopamine responses are thought to encode a prediction-error signal consistent with model-free reinforcement learning theories. However, a number of recent findings highlight the influence of model-based computations on dopamine responses, and suggest that dopamine prediction errors reflect more dimensions of an expected outcome than scalar reward value. Here, we review a selection of these recent results and discuss the implications and complications of model-based predictions for computational theories of dopamine and learning.
Hermsdorff, G. B., Pereira, T., & Niv, Y. (2018). Quantifying Humans' Priors Over Graphical Representations of Tasks. In Springer Proceedings in Complexity (pp. 281–290). PDFAbstract
Some new tasks are trivial to learn while others are almost impossible; what determines how easy it is to learn an arbitrary task? Similar to how our prior beliefs about new visual scenes colors our per- ception of new stimuli, our priors about the structure of new tasks shapes our learning and generalization abilities [2]. While quantifying visual pri- ors has led to major insights on how our visual system works [5,10,11], quantifying priors over tasks remains a formidable goal, as it is not even clear how to define a task [4]. Here, we focus on tasks that have a natural mapping to graphs.We develop a method to quantify humans' priors over these “task graphs”, combining new modeling approaches with Markov chain Monte Carlo with people, MCMCP (a process whereby an agent learns from data generated by another agent, recursively [9]). We show that our method recovers priors more accurately than a standard MCMC sampling approach. Additionally, we propose a novel low-dimensional “smooth” (In the sense that graphs that differ by fewer edges are given similar probabilities.) parametrization of probability distributions over graphs that allows for more accurate recovery of the prior and better generalization.We have also created an online experiment platform that gamifies ourMCMCPalgorithm and allows subjects to interactively draw the task graphs. We use this platform to collect human data on sev- eral navigation and social interactions tasks. We show that priors over these tasks have non-trivial structure, deviating significantly from null models that are insensitive to the graphical information. The priors also notably differ between the navigation and social domains, showing fewer differences between cover stories within the same domain. Finally, we extend our framework to the more general case of quantifying priors over exchangeable random structures.
Schuck, N. W., Wilson, R. C., & Niv, Y. (2018). A State Representation for Reinforcement Learning and Decision-Making in the Orbitofrontal Cortex. In Goal-Directed Decision Making. PDFAbstract
Despite decades of research, the exact ways in which the orbitofrontal cortex (OFC) influences cognitive function have remained mysterious. Anatomically, the OFC is characterized by remarkably broad connectivity to sensory, limbic and subcortical areas, and functional studies have implicated the OFC in a plethora of functions ranging from facial processing to value-guided choice. Notwithstanding such diversity of findings, much research suggests that one important function of the OFC is to support decision making and reinforcement learning. Here, we describe a novel theory that posits that OFC's specific role in decision-making is to provide an up-to-date representation of task-related information, called a state representation. This representation reflects a mapping between distinct task states and sensory as well as unobservable information. We summarize evidence supporting the existence of such state representations in rodent and human OFC and argue that forming these state representations provides a crucial scaffold that allows animals to efficiently perform decision making and reinforcement learning in high-dimensional and partially observable environments. Finally, we argue that our theory offers an integrating framework for linking the diversity of functions ascribed to OFC and is in line with its wide ranging connectivity.
2017
Cohen, J. D., Daw, N. D., Engelhardt, B., Hasson, U., Li, K., Niv, Y., Norman, K. A., et al. (2017). Computational approaches to fMRI analysis. Nature Neuroscience , 20 (3), 304–313. PDFAbstract
Multi-walled carbon nanotubes (MWCNT) and carbon nanofibers (CNF) were created using chemical vapor deposition at growth temperatures between 500 and 750 ??C, which have increasing crystallinity with increasing growth temperature. We used Raman spectroscopy to analyze the samples. The intensity ratios compared to the G-band, and full-width at half-maximum, of all observable Raman bands in both the first and second-order region were investigated. Good match was observed for the defect related bands of the MWCNT samples and data found in the literature. Several second-order bands display a strong dependency to growth temperature. Similar growth temperature (and thus defect) dependencies were found between several first and second-order bands, which might aid in determining the physical causes of these bands. CNF show much weaker Raman features due to their low crystallinity, making them more difficult to analyse. The results of this work are used to give recommendations on how to investigate MWCNT and CNF crystallinity using Raman spectroscopy. Finally, we demonstrate that Raman spectroscopy can be used to distinguish between the MWCNT root and tip growth mechanism. ?? 2012 Elsevier Ltd. All rights reserved.
Gershman, S. J., Monfils, M. - H., Norman, K. A., & Niv, Y. (2017). The computational nature of memory modification. eLife , 6. PDFAbstract
\textlessp\textgreaterRetrieving a memory can modify its influence on subsequent behavior. We develop a computational theory of memory modification, according to which modification of a memory trace occurs through classical associative learning, but which memory trace is eligible for modification depends on a structure learning mechanism that discovers the units of association by segmenting the stream of experience into statistically distinct clusters (latent causes). New memories are formed when the structure learning mechanism infers that a new latent cause underlies current sensory observations. By the same token, old memories are modified when old and new sensory observations are inferred to have been generated by the same latent cause. We derive this framework from probabilistic principles, and present a computational implementation. Simulations demonstrate that our model can reproduce the major experimental findings from studies of memory modification in the Pavlovian conditioning literature.\textless/p\textgreater
DuBrow, S., Rouhani, N., Niv, Y., & Norman, K. A. (2017). Does mental context drift or shift? Current Opinion in Behavioral Sciences , 17, 141–146. PDFAbstract
Theories of episodic memory have proposed that individual memory traces are linked together by a representation of context that drifts slowly over time. Recent data challenge the notion that contextual drift is always slow and passive. In particular, changes in one's external environment or internal model induce discontinuities in memory that are reflected in sudden changes in neural activity, suggesting that context can shift abruptly. Furthermore, context change effects are sensitive to top-down goals, suggesting that contextual drift may be an active process. These findings call for revising models of the role of context in memory, in order to account for abrupt contextual shifts and the controllable nature of context change.
Leong, Y. C., Radulescu, A., Daniel, R., DeWoskin, V., & Niv, Y. (2017). Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments. Neuron , 93 (2), 451–463. PDFAbstract
Little is known about the relationship between attention and learning during decision making. Using eye tracking and multivariate pattern analysis of fMRI data, we measured participants' dimensional attention as they performed a trial-and-error learning task in which only one of three stimulus dimensions was relevant for reward at any given time. Analysis of participants' choices revealed that attention biased both value computation during choice and value update during learning. Value signals in the ventromedial prefrontal cortex and prediction errors in the striatum were similarly biased by attention. In turn, participants' focus of attention was dynamically modulated by ongoing learning. Attentional switches across dimensions correlated with activity in a frontoparietal attention network, which showed enhanced connectivity with the ventromedial prefrontal cortex between switches. Our results suggest a bidirectional interaction between attention and learning: attention constrains learning to relevant dimensions of the environment, while we learn what to attend to via trial and error.
Sharpe, M. J., Marchant, N. J., Whitaker, L. R., Richie, C. T., Zhang, Y. J., Campbell, E. J., Koivula, P. P., et al. (2017). Lateral Hypothalamic GABAergic Neurons Encode Reward Predictions that Are Relayed to the Ventral Tegmental Area to Regulate Learning. Current Biology , 27 (14), 2089––2100.e5. PDFAbstract
Eating is a learned process. Our desires for specific foods arise through experience. Both electrical stimulation and optogenetic studies have shown that increased activity in the lateral hypothalamus (LH) promotes feeding. Current dogma is that these effects reflect a role for LH neurons in the control of the core motivation to feed, and their activity comes under control of forebrain regions to elicit learned food-motivated behaviors. However, these effects could also reflect the storage of associative information about the cues leading to food in LH itself. Here, we present data from several studies that are consistent with a role for LH in learning. In the first experiment, we use a novel GAD-Cre rat to show that optogenetic inhibition of LH \$\gamma\$-aminobutyric acid (GABA) neurons restricted to cue presentation disrupts the rats' ability to learn that a cue predicts food without affecting subsequent food consumption. In the second experiment, we show that this manipulation also disrupts the ability of a cue to promote food seeking after learning. Finally, we show that inhibition of the terminals of the LH GABA neurons in ventral-tegmental area (VTA) facilitates learning about reward-paired cues. These results suggest that the LH GABA neurons are critical for storing and later disseminating information about reward-predictive cues.
Auchter, A., Cormack, L. K., Niv, Y., Gonzalez-Lima, F., & Monfils, M. - H. (2017). Reconsolidation-Extinction Interactions in Fear Memory Attenuation: The Role of Inter-Trial Interval Variability. Frontiers in Behavioral Neuroscience , 11. PDFAbstract
Most of life is extinct, so incorporating some fossil evidence into analyses of macroevolution is typically seen as necessary to understand the diversification of life and patterns of morphological evolution. Here we test the effects of inclusion of fossils in a study of the body size evolution of afrotherian mammals, a clade that includes the elephants, sea cows and elephant shrews. We find that the inclusion of fossil tips has little impact on analyses of body mass evolution; from a small ancestral size (approx. 100 g), there is a shift in rate and an increase in mass leading to the larger-bodied Paenungulata and Tubulidentata, regardless of whether fossils are included or excluded from analyses. For Afrotheria, the inclusion of fossils and morphological character data affect phylogenetic topology, but these differences have little impact upon patterns of body mass evolution and these body mass evolutionary patterns are consistent with the fossil record. The largest differences between our analyses result from the evolutionary model, not the addition of fossils. For some clades, extant-only analyses may be reliable to reconstruct body mass evolution, but the addition of fossils and careful model selection is likely to increase confidence and accuracy of reconstructed macroevolutionary patterns.
2016
Eldar, E., Cohen, J. D., & Niv, Y. (2016). Amplified selectivity in cognitive processing implements the neural gain model of norepinephrine function. The Behavioral and brain sciences , 39, e206. PDFAbstract
\textlessp\textgreaterPrevious work has suggested that an interaction between local selective (e.g., glutamatergic) excitation and global gain modulation (via norepinephrine) amplifies selectivity in information processing. Mather et al. extend this existing theory by suggesting that localized gain modulation may further mediate this effect – an interesting prospect that invites new theoretical and experimental work.\textless/p\textgreater
Cai, M. B., & Schuck, N. W. (2016). A Bayesian method for reducing bias in neural representational similarity analysis. In D. D. Lee, U. V. Luxburg, I. Guyon, & R. Garnett (Ed.), Advances In Neural Information Processing Systems 29 (pp. 4952–4960) . Curran Associates, Inc. PDF
Kurth-Nelson, Z., O'Doherty, J. P., Barch, D. M., Denève, S., Durstewitz, D., Frank, M. J., Gordon, J. A., et al. (2016). Computational Approaches for Studying Mechanisms of Psychiatric Disorders. In Computational Psychiatry . The MIT Press. PDFAbstract
Vast spectra of biological and psychological processes are potentially involved in the mechanisms of psychiatric illness. Computational neuroscience brings a diverse toolkit to bear on understanding these processes. This chapter begins by organizing the many ways in which computational neuroscience may provide insight to the mechanisms of psychiatric illness. It then contextualizes the quest for deep mechanistic understanding through the perspective that even partial or nonmechanistic understanding can be applied productively. Finally, it questions the standards by which these approaches...
Eldar, E., Niv, Y., & Cohen, J. D. (2016). Do You See the Forest or the Tree? Neural Gain and Breadth Versus Focus in Perceptual Processing. Psychological Science , 27 (12), 1632–1643. PDFAbstract
When perceiving rich sensory information, some people may integrate its various aspects, whereas other people may selectively focus on its most salient aspects. We propose that neural gain modulates the trade-off between breadth and selectivity, such that high gain focuses perception on those aspects of the information that have the strongest, most immediate influence, whereas low gain allows broader integration of different aspects. We illustrate our hypothesis using a neural-network model of ambiguous-letter perception. We then report an experiment demonstrating that, as predicted by the model, pupil-diameter indices of higher gain are associated with letter perception that is more selectively focused on the letter's shape or, if primed, its semantic content. Finally, we report a recognition-memory experiment showing that the relationship between gain and selective processing also applies when the influence of different stimulus features is voluntarily modulated by task demands.
Arkadir, D., Radulescu, A., Raymond, D., Lubarr, N., Bressman, S. B., Mazzoni, P., & Niv, Y. (2016). DYT1 dystonia increases risk taking in humans. eLife , 5 (JUN2016). PDFAbstract
It has been difficult to link synaptic modification to overt behavioral changes. Rodent models of DYT1 dystonia, a motor disorder caused by a single gene mutation, demonstrate increased long-term potentiation and decreased long-term depression in corticostriatal synapses. Computationally, such asymmetric learning predicts risk taking in probabilistic tasks. Here we demonstrate abnormal risk taking in DYT1 dystonia patients, which is correlated with disease severity, thereby supporting striatal plasticity in shaping choice behavior in humans.
Radulescu, A., Daniel, R., & Niv, Y. (2016). The effects of aging on the interaction between reinforcement learning and attention. Psychology and Aging , 31 (7), 747–757. PDFAbstract
Predicting the binding mode of flexible polypeptides to proteins is an important task that falls outside the domain of applicability of most small molecule and protein−protein docking tools. Here, we test the small molecule flexible ligand docking program Glide on a set of 19 non-\$\alpha\$-helical peptides and systematically improve pose prediction accuracy by enhancing Glide sampling for flexible polypeptides. In addition, scoring of the poses was improved by post-processing with physics-based implicit solvent MM- GBSA calculations. Using the best RMSD among the top 10 scoring poses as a metric, the success rate (RMSD ≤ 2.0 \AAfor the interface backbone atoms) increased from 21% with default Glide SP settings to 58% with the enhanced peptide sampling and scoring protocol in the case of redocking to the native protein structure. This approaches the accuracy of the recently developed Rosetta FlexPepDock method (63% success for these 19 peptides) while being over 100 times faster. Cross-docking was performed for a subset of cases where an unbound receptor structure was available, and in that case, 40% of peptides were docked successfully. We analyze the results and find that the optimized polypeptide protocol is most accurate for extended peptides of limited size and number of formal charges, defining a domain of applicability for this approach.
Schuck, N. W., Cai, M. B., Wilson, R. C., & Niv, Y. (2016). Human Orbitofrontal Cortex Represents a Cognitive Map of State Space. Neuron , 91 (6), 1402–1412. PDFAbstract
Although the orbitofrontal cortex (OFC) has been studied intensely for decades, its precise functions have remained elusive. We recently hypothesized that the OFC contains a “cognitive map” of task space in which the current state of the task is represented, and this representation is especially critical for behavior when states are unobservable from sensory input. To test this idea, we apply pattern-classification techniques to neuroimaging data from humans performing a decision-making task with 16 states. We show that unobservable task states can be decoded from activity in OFC, and decoding accuracy is related to task performance and the occurrence of individual behavioral errors. Moreover, similarity between the neural representations of consecutive states correlates with behavioral accuracy in corresponding state transitions. These results support the idea that OFC represents a cognitive map of task space and establish the feasibility of decoding state representations in humans using non-invasive neuroimaging.
Eldar, E., Rutledge, R. B., Dolan, R. J., & Niv, Y. (2016). Mood as Representation of Momentum. Trends in Cognitive Sciences , 20 (1), 15–24. PDFAbstract
Experiences affect mood, which in turn affects subsequent experiences. Recent studies suggest two specific principles. First, mood depends on how recent reward outcomes differ from expectations. Second, mood biases the way we perceive outcomes (e.g., rewards), and this bias affects learning about those outcomes. We propose that this two-way interaction serves to mitigate inefficiencies in the application of reinforcement learning to real-world problems. Specifically, we propose that mood represents the overall momentum of recent outcomes, and its biasing influence on the perception of outcomes 'corrects' learning to account for environmental dependencies. We describe potential dysfunctions of this adaptive mechanism that might contribute to the symptoms of mood disorders. ©2015 The Authors.
Chan, S. C. Y., Niv, Y., & Norman, K. A. (2016). A probability distribution over latent causes, in the orbitofrontal cortex. Journal of Neuroscience , 36 (30), 7817–7828. PDFAbstract
The orbitofrontal cortex (OFC) has been implicated in both the representation of "state," in studies of reinforcement learning and decision making, and also in the representation of "schemas," in studies of episodic memory. Both of these cognitive constructs require a similar inference about the underlying situation or "latent cause" that generates our observations at any given time. The statistically optimal solution to this inference problem is to use Bayes' rule to compute a posterior probability distribution over latent causes. To test whether such a posterior probability distribution is represented in the OFC, we tasked human participants with inferring a probability distribution over four possible latent causes, based on their observations. Using fMRI pattern similarity analyses, we found that BOLD activity in the OFC is best explained as representing the (log-transformed) posterior distribution over latent causes. Furthermore, this pattern explained OFC activity better than other task-relevant alternatives, such as the most probable latent cause, the most recent observation, or the uncertainty over latent causes. ©2016 the authors.
Niv, Y., & Langdon, A. J. (2016). Reinforcement learning with Marr. Current Opinion in Behavioral Sciences , 11, 67–73. PDFAbstract
To many, the poster child for David Marr's famous three levels of scientific inquiry is reinforcement learning – a computational theory of reward optimization, which readily prescribes algorithmic solutions that evidence striking resemblance to signals found in the brain, suggesting a straightforward neural implementation. Here we review questions that remain open at each level of analysis, concluding that the path forward to their resolution calls for inspiration across levels, rather than a focus on mutual constraints.
Takahashi, Y. K., Langdon, A. J., Niv, Y., & Schoenbaum, G. (2016). Temporal Specificity of Reward Prediction Errors Signaled by Putative Dopamine Neurons in Rat VTA Depends on Ventral Striatum. Neuron , 91 (1), 182–193. PDFAbstract
Dopamine neurons signal reward prediction errors. This requires accurate reward predictions. It has been suggested that the ventral striatum provides these predictions. Here we tested this hypothesis by recording from putative dopamine neurons in the VTA of rats performing a task in which prediction errors were induced by shifting reward timing or number. In controls, the neurons exhibited error signals in response to both manipulations. However, dopamine neurons in rats with ipsilateral ventral striatal lesions exhibited errors only to changes in number and failed to respond to changes in timing of reward. These results, supported by computational modeling, indicate that predictions about the temporal specificity and the number of expected reward are dissociable and that dopaminergic prediction-error signals rely on the ventral striatum for the former but not the latter.
2015
Gershman, S. J., Norman, K. A., & Niv, Y. (2015). Discovering latent causes in reinforcement learning. Current Opinion in Behavioral Sciences , 5 43–50. PDFAbstract
Effective reinforcement learning hinges on having an appropriate state representation. But where does this representation come from? We argue that the brain discovers state representations by trying to infer the latent causal structure of the task at hand, and assigning each latent cause to a separate state. In this paper, we review several implications of this latent cause framework, with a focus on Pavlovian conditioning. The framework suggests that conditioning is not the acquisition of associations between cues and outcomes, but rather the acquisition of associations between latent causes and observable stimuli. A latent cause interpretation of conditioning enables us to begin answering questions that have frustrated classical theories: Why do extinguished responses sometimes return? Why do stimuli presented in compound sometimes summate and sometimes do not? Beyond conditioning, the principles of latent causal inference may provide a general theory of structure learning across cognitive domains.
Niv, Y., Langdon, A. J., & Radulescu, A. (2015). A free-choice premium in the basal ganglia. Trends in Cognitive Sciences , 19 (1), 4–5. PDFAbstract
Apparently, the act of free choice confers value: when selecting between an item that you had previously chosen and an identical item that you had been forced to take, the former is often preferred. What could be the neural underpinnings of this free-choice bias in decision making? An elegant study recently published in Neuron suggests that enhanced reward learning in the basal ganglia may be the culprit.
Daniel, R., Schuck, N. W., & Niv, Y. (2015). How to divide and conquer the world, one step at a time. Proceedings of the National Academy of Sciences , 112 (10), 2929–2930. PDF
Eldar, E., & Niv, Y. (2015). Interaction between emotional state and learning underlies mood instability. Nature Communications , 6 (1), 6149. PDFAbstract
Intuitively, good and bad outcomes affect our emotional state, but whether the emotional state feeds back onto the perception of outcomes remains unknown. Here, we use behaviour and functional neuroimaging of human participants to investigate this bidirectional interaction, by comparing the evaluation of slot machines played before and after an emotion-impacting wheel-of-fortune draw. Results indicate that self-reported mood instability is associated with a positive-feedback effect of emotional state on the perception of outcomes. We then use theoretical simulations to demonstrate that such positive feedback would result in mood destabilization. Taken together, our results suggest that the interaction between emotional state and learning may play a significant role in the emergence of mood instability.
Wilson, R. C., & Niv, Y. (2015). Is Model Fitting Necessary for Model-Based fMRI? PLoS Comput Biol , 11 (6), e1004237. PDFAbstract
Model-based analysis of fMRI data is an important tool for investigating the computational role of different brain regions. With this method, theoretical models of behavior can be leveraged to find the brain structures underlying variables from specific algorithms, such as prediction errors in reinforcement learning. One potential weakness with this approach is that models often have free parameters and thus the results of the analysis may depend on how these free parameters are set. In this work we asked whether this hypothetical weakness is a problem in practice. We first developed general closed-form expressions for the relationship between results of fMRI analyses using different regressors, e.g., one corresponding to the true process underlying the measured data and one a model-derived approximation of the true generative regressor. Then, as a specific test case, we examined the sensitivity of model-based fMRI to the learning rate parameter in reinforcement learning, both in theory and in two previously-published datasets. We found that even gross errors in the learning rate lead to only minute changes in the neural results. Our findings thus suggest that precise model fitting is not always necessary for model-based fMRI. They also highlight the difficulty in using fMRI data for arbitrating between different models or model parameters. While these specific results pertain only to the effect of learning rate in simple reinforcement learning models, we provide a template for testing for effects of different parameters in other models.
Gershman, S. J., & Niv, Y. (2015). Novelty and Inductive Generalization in Human Reinforcement Learning. Topics in Cognitive Science , 7 (3), 391–415. PDFAbstract
In reinforcement learning (RL), a decision maker searching for the most rewarding option is often faced with the question: What is the value of an option that has never been tried before? One way to frame this question is as an inductive problem: How can I generalize my previous experi-ence with one set of options to a novel option? We show how hierarchical Bayesian inference can be used to solve this problem, and we describe an equivalence between the Bayesian model and temporal difference learning algorithms that have been proposed as models of RL in humans and animals. According to our view, the search for the best option is guided by abstract knowledge about the relationships between different options in an environment, resulting in greater search effi-ciency compared to traditional RL algorithms previously applied to human cognition. In two behav-ioral experiments, we test several predictions of our model, providing evidence that humans learn and exploit structured inductive knowledge to make predictions about novel options. In light of this model, we suggest a new interpretation of dopaminergic responses to novelty.
Niv, Y., Daniel, R., Geana, A., Gershman, S. J., Leong, Y. C., Radulescu, A., & Wilson, R. C. (2015). Reinforcement Learning in Multidimensional Environments Relies on Attention Mechanisms. Journal of Neuroscience , 35 (21), 8145–8157. PDFAbstract
In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this "representation learning" process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the "curse of dimensionality" in reinforcement learning.
Dunsmoor, J. E., Niv, Y., Daw, N. D., & Phelps, E. A. (2015). Rethinking Extinction. Neuron , 88 (1), 47–63. PDFAbstract
Extinction serves as the leading theoretical framework and experimental model to describe how learned behaviors diminish through absence of anticipated reinforcement. In the past decade, extinction has moved beyond the realm of associative learning theory and behavioral experimentation in animals and has become a topic of considerable interest in the neuroscience of learning, memory, and emotion. Here, we review research and theories of extinction, both as a learning process and as a behavioral technique, and consider whether traditional understandings warrant a re-examination. We discuss the neurobiology, cognitive factors, and major computational theories, and revisit the predominant view that extinction results in new learning that interferes with expression of the original memory. Additionally, we reconsider the limitations of extinction as a technique to prevent the relapse of maladaptive behavior and discuss novel approaches, informed by contemporary theoretical advances, that augment traditional extinction methods to target and potentially alter maladaptive memories.
Sharpe, M. J., Wikenheiser, A. M., Niv, Y., & Schoenbaum, G. (2015). The State of the Orbitofrontal Cortex. Neuron , 88 (6), 1075–1077. PDFAbstract
State representation is fundamental to behavior. However, identifying the true state of the world is challenging when explicit cues are ambiguous. Here, Bradfield and colleagues show that the medial OFC is critical for using associative information to discriminate ambiguous states. State representation is fundamental to behavior. However, identifying the true state of the world is challenging when explicit cues are ambiguous. Here, Bradfield and colleagues show that the medial OFC is critical for using associative information to discriminate ambiguous states.
2014
Geana, A., & Niv, Y. (2014). Causal model comparison shows that human representation learning is not Bayesian. Cold Spring Harbor Symposia on Quantitative Biology , 79, 161–168. PDFAbstract
How do we learn what features of our multidimensional environment are relevant in a given task? To study the computational process underlying this type of "representation learning," we propose a novel method of causal model comparison. Participants played a probabilistic learning task that required them to identify one relevant feature among several irrelevant ones. To compare between two models of this learning process, we ran each model alongside the participant during task performance, making predictions regarding the values underlying the participant's choices in real time. To test the validity of each model's predictions, we used the predicted values to try to perturb the participant's learning process: We crafted stimuli to either facilitate or hinder comparison between the most highly valued features. A model whose predictions coincide with the learned values in the participant's mind is expected to be effective in perturbing learning in this way, whereas a model whose predictions stray from the true learning process should not. Indeed, we show that in our task a reinforcement-learning model could help or hurt participants' learning, whereas a Bayesian ideal observer model could not. Beyond informing us about the notably suboptimal (but computationally more tractable) substrates of human representation learning, our manipulation suggests a sensitive method for model comparison, which allows us to change the course of people's learning in real time.
Soto, F. A., Gershman, S. J., & Niv, Y. (2014). Explaining compound generalization in associative and causal learning through rational principles of dimensional generalization. Psychological Review , 121 (3), 526–558. PDFAbstract
How do we apply learning from one situation to a similar, but not identical, situation? The principles governing the extent to which animals and humans generalize what they have learned about certain stimuli to novel compounds containing those stimuli vary depending on a number of factors. Perhaps the best studied among these factors is the type of stimuli used to generate compounds. One prominent hypothesis is that different generalization principles apply depending on whether the stimuli in a compound are similar or dissimilar to each other. However, the results of many experiments cannot be explained by this hypothesis. Here, we propose a rational Bayesian theory of compound generalization that uses the notion of consequential regions, first developed in the context of rational theories of multidimensional generalization, to explain the effects of stimulus factors on compound generalization. The model explains a large number of results from the compound generalization literature, including the influence of stimulus modality and spatial contiguity on the summation effect, the lack of influence of stimulus factors on summation with a recovered inhibitor, the effect of spatial position of stimuli on the blocking effect, the asymmetrical generalization decrement in overshadowing and external inhibition, and the conditions leading to a reliable external inhibition effect. By integrating rational theories of compound and dimensional generalization, our model provides the first comprehensive computational account of the effects of stimulus factors on compound generalization, including spatial and temporal contiguity between components, which have posed long-standing problems for rational theories of associative and causal learning.
Solway, A., Diuk, C., Córdova, N., Yee, D., Barto, A. G., Niv, Y., & Botvinick, M. M. (2014). Optimal Behavioral Hierarchy. PLoS Computational Biology , 10 (8), e1003779. PDFAbstract
Human behavior has long been recognized to display hierarchical structure: actions fit together into subtasks, which cohere into extended goal-directed activities. Arranging actions hierarchically has well established benefits, allowing behaviors to be represented efficiently by the brain, and allowing solutions to new tasks to be discovered easily. However, these payoffs depend on the particular way in which actions are organized into a hierarchy, the specific way in which tasks are carved up into subtasks. We provide a mathematical account for what makes some hierarchies better than others, an account that allows an optimal hierarchy to be identified for any set of tasks. We then present results from four behavioral experiments, suggesting that human learners spontaneously discover optimal action hierarchies.
Wilson, R. C., Takahashi, Y. K., Schoenbaum, G., & Niv, Y. (2014). Orbitofrontal Cortex as a Cognitive Map of Task Space. D. D. Lee, U. V. Luxburg, I. Guyon, & R. Garnett (Ed.), Neuron , 81 (2), 267–279 . Salt Lake City, Utah, Elsevier Inc. PDFAbstract
Orbitofrontal cortex (OFC) has long been known to play an important role in decision making. However, the exact nature of that role has remained elusive. Here, we propose a unifying theory of OFC function. We hypothesize that OFC provides an abstraction of currently available information in the form of a labeling of the current task state, which is used for reinforcement learning (RL) elsewhere in the brain. This function is especially critical when task states include unobservable information, for instance, from working memory. We use this framework to explain classic findings in reversal learning, delayed alternation, extinction, and devaluation as well as more recent findings showing the effect of OFC lesions on the firing of dopaminergic neurons in ventral tegmental area (VTA) in rodents performing an RL task. In addition, we generate a number of testable experimental predictions that can distinguish our theory from other accounts of OFC function. ?? 2014 Elsevier Inc.
Gershman, S. J., Radulescu, A., Norman, K. A., & Niv, Y. (2014). Statistical Computations Underlying the Dynamics of Memory Updating. PLoS Computational Biology , 10 (11), e1003939. PDFAbstract
Psychophysical and neurophysiological studies have suggested that memory is not simply a carbon copy of our experience: Memories are modified or new memories are formed depending on the dynamic structure of our experience, and specifically, on how gradually or abruptly the world changes. We present a statistical theory of memory formation in a dynamic environment, based on a nonparametric generalization of the switching Kalman filter. We show that this theory can qualitatively account for several psychophysical and neural phenomena, and present results of a new visual memory experiment aimed at testing the theory directly. Our experimental findings suggest that humans can use temporal discontinuities in the structure of the environment to determine when to form new memory traces. The statistical perspective we offer provides a coherent account of the conditions under which new experience is integrated into an old memory versus forming a new memory, and shows that memory formation depends on inferences about the underlying structure of our experience.
2013
Diuk, C., Schapiro, A., Córdova, N., Ribas-Fernandes, J. J. F., Niv, Y., & Botvinick, M. M. (2013). Divide and conquer: Hierarchical reinforcement learning and task decomposition in humans. Computational and Robotic Models of the Hierarchical Organization of Behavior (Vol. 9783642398, pp. 271–291). PDFAbstract
The field of computational reinforcement learning (RL) has proved extremely useful in research on human and animal behavior and brain function. However, the simple forms of RL considered in most empirical research do not scale well, making their relevance to complex, real-world behavior unclear. In computational RL, one strategy for addressing the scaling problem is to intro-duce hierarchical structure, an approach that has intriguing parallels with human behavior. We have begun to investigate the potential relevance of hierarchical RL (HRL) to human and animal behavior and brain function. In the present chapter, we first review two results that show the existence of neural correlates to key predictions from HRL. Then, we focus on one aspect of this work, which deals with the question of how action hierarchies are initially established. Work in HRL suggests that hierarchy learning is accomplished by identifying useful subgoal states, and that this might in turn be accomplished through a structural analysis of the given task domain. We review results from a set of behavioral and neuroimaging experiments, in which we have investigated the relevance of these ideas to human learning and decision making.
Eldar, E., Cohen, J. D., & Niv, Y. (2013). The effects of neural gain on attention and learning. Nature Neuroscience , 16 (8), 1146–1153. PDFAbstract
Nature Neuroscience 16, 1146 (2013). doi:10.1038/nn.3428
Gershman, S. J., Jones, C. E., Norman, K. A., Monfils, M. - H., & Niv, Y. (2013). Gradual extinction prevents the return of fear: implications for the discovery of state. Front Behav Neurosci , 7 164. PDFAbstract
Fear memories are notoriously difficult to erase, often recovering over time. The longstanding explanation for this finding is that, in extinction training, a new memory is formed that competes with the old one for expression but does not otherwise modify it. This explanation is at odds with traditional models of learning such as Rescorla-Wagner and reinforcement learning. A possible reconciliation that was recently suggested is that extinction training leads to the inference of a new state that is different from the state that was in effect in the original training. This solution, however, raises a new question: under what conditions are new states, or new memories formed? Theoretical accounts implicate persistent large prediction errors in this process. As a test of this idea, we reasoned that careful design of the reinforcement schedule during extinction training could reduce these prediction errors enough to prevent the formation of a new memory, while still decreasing reinforcement sufficiently to drive modification of the old fear memory. In two Pavlovian fear-conditioning experiments, we show that gradually reducing the frequency of aversive stimuli, rather than eliminating them abruptly, prevents the recovery of fear. This finding has important implications for theories of state discovery in reinforcement learning.
Diuk, C., Tsai, K., Wallis, J., Botvinick, M. M., & Niv, Y. (2013). Hierarchical Learning Induces Two Simultaneous, But Separable, Prediction Errors in Human Basal Ganglia. Journal of Neuroscience , 33 (13), 5797–5805. PDFAbstract
Studies suggest that dopaminergic neurons report a unitary, global reward prediction error signal. However, learning in complex real-life tasks, in particular tasks that show hierarchical structure, requires multiple prediction errors that may coincide in time. We used functional neuroimaging to measure prediction error signals in humans performing such a hierarchical task involving simultaneous, uncorrelated prediction errors. Analysis of signals in a priori anatomical regions of interest in the ventral striatum and the ventral tegmental area indeed evidenced two simultaneous, but separable, prediction error signals corresponding to the two levels of hierarchy in the task. This result suggests that suitably designed tasks may reveal a more intricate pattern of firing in dopaminergic neurons. Moreover, the need for downstream separation of these signals implies possible limitations on the number of different task levels that we can learn about simultaneously.
Schoenbaum, G., Stalnaker, T. A., & Niv, Y. (2013). How Did the Chicken Cross the Road? With Her Striatal Cholinergic Interneurons, Of Course. Neuron , 79 (1), 3–6. PDFAbstract
Recognizing when the world changes is fundamental for normal learning. In this issue of Neuron, Bradfield etal. (2013) show that cholinergic interneurons in dorsomedial striatum are critical to the process whereby new states of the world are appropriately registered and retrieved during associative learning
Niv, Y. (2013). Neuroscience: Dopamine ramps up. Nature , 500 (7464), 533–535. PDFAbstract
We thought we had figured out dopamine, a neuromodulator involved in everything from learning to addiction. But the finding that dopamine levels ramp up as rats navigate to a reward may overthrow current theories. See Letter p.575
Gershman, S. J., & Niv, Y. (2013). Perceptual estimation obeys Occam's razor. Frontiers in Psychology , 4 (SEP), 623. PDFAbstract
Theoretical models of unsupervised category learning postulate that humans “invent” categories to accommodate new patterns, but tend to group stimuli into a small number of categories. This “Occam's razor” principle is motivated by normative rules of statistical inference. If categories influence perception, then one should find effects of category invention on simple perceptual estimation. In a series of experiments, we tested this prediction by asking participants to estimate the number of colored circles on a computer screen, with the number of circles drawn from a color-specific distribution. When the distributions associated with each color overlapped substantially, participants' estimates were biased toward values intermediate between the two means, indicating that subjects ignored the color of the circles and grouped different-colored stimuli into one perceptual category. These data suggest that humans favor simpler explanations of sensory inputs. In contrast, when the distributions associated with each color overlapped minimally, the bias was reduced (i.e., the estimates for each color were closer to the true means), indicating that sensory evidence for more complex explanations can override the simplicity bias. We present a rational analysis of our task, showing how these qualitative patterns can arise from Bayesian computations.
2012
Gershman, S. J., & Niv, Y. (2012). Exploring a latent cause theory of classical conditioning. Learn Behav , 40 (3), 255–268. PDFAbstract
We frame behavior in classical conditioning experiments as the product of normative statistical inference. According to this theory, animals learn an internal model of their environment from experience. The basic building blocks of this internal model are latent causes-explanatory constructs inferred by the animal that partition observations into coherent clusters. Generalization of conditioned responding from one cue to another arises from the animal's inference that the cues were generated by the same latent cause. Through a wide range of simulations, we demonstrate where the theory succeeds and where it fails as a general account of classical conditioning.
Lucantonio, F., Stalnaker, T. A., Shaham, Y., Niv, Y., & Schoenbaum, G. (2012). The impact of orbitofrontal dysfunction on cocaine addiction. Nature Neuroscience , 15 (3), 358–366. PDFAbstract
Cocaine addiction is characterized by poor judgment and maladaptive decision-making. Here we review evidence implicating the orbitofrontal cortex in such behavior. This evidence suggests that cocaine-induced changes in orbitofrontal cortex disrupt the representation of states and transition functions that form the basis of flexible and adaptive 'model-based' behavioral control. By impairing this function, cocaine exposure leads to an overemphasis on less flexible, maladaptive 'model-free' control systems. We propose that such an effect accounts for the complex pattern of maladaptive behaviors associated with cocaine addiction.
Wilson, R. C., & Niv, Y. (2012). Inferring Relevance in a Changing World. Frontiers in Human Neuroscience , 5 (JANUARY 2012), 189. PDFAbstract
Reinforcement learning models of human and animal learning usually concentrate on how we learn the relationship between different stimuli or actions and rewards. However, in real-world situations "stimuli" are ill-defined. On the one hand, our immediate environment is extremely multidimensional. On the other hand, in every decision making scenario only a few aspects of the environment are relevant for obtaining reward, while most are irrelevant. Thus a key question is how do we learn these relevant dimensions, that is, how do we learn what to learn about? We investigated this process of "representation learning" experimentally, using a task in which one stimulus dimension was relevant for determining reward at each point in time. As in real life situations, in our task the relevant dimension can change without warning, adding ever-present uncertainty engendered by a constantly changing environment. We show that human performance on this task is better described by a suboptimal strategy based on selective attention and serial-hypothesis-testing rather than a normative strategy based on probabilistic inference. From this, we conjecture that the problem of inferring relevance in general scenarios is too computationally demanding for the brain to solve optimally. As a result the brain utilizes approximations, employing these even in simplified scenarios in which optimal representation learning is tractable, such as the one in our experiment.
Niv, Y., Edlund, J. A., Dayan, P. D., & O'Doherty, J. P. (2012). Neural Prediction Errors Reveal a Risk-Sensitive Reinforcement-Learning Process in the Human Brain. Journal of Neuroscience , 32 (2), 551–562. PDFAbstract
Humans and animals are exquisitely, though idiosyncratically, sensitive to risk or variance in the outcomes of their actions. Economic, psychological, and neural aspects of this are well studied when information about risk is provided explicitly. However, we must normally learn about outcomes from experience, through trial and error. Traditional models of such reinforcement learning focus on learning about the mean reward value of cues and ignore higher order moments such as variance. We used fMRI to test whether the neural correlates of human reinforcement learning are sensitive to experienced risk. Our analysis focused on anatomically delineated regions of a priori interest in the nucleus accumbens, where blood oxygenation level-dependent (BOLD) signals have been suggested as correlating with quantities derived from reinforcement learning. We first provide unbiased evidence that the raw BOLD signal in these regions corresponds closely to a reward prediction error. We then derive from this signal the learned values of cues that predict rewards of equal mean but different variance and show that these values are indeed modulated by experienced risk. Moreover, a close neurometric-psychometric coupling exists between the fluctuations of the experience-based evaluations of risky options that we measured neurally and the fluctuations in behavioral risk aversion. This suggests that risk sensitivity is integral to human learning, illuminating economic models of choice, neuroscientific models of affective learning, and the workings of the underlying neural mechanisms.
2011
Niv, Y., & Chan, S. (2011). On the value of information and other rewards. Nature Neuroscience , 14 (9), 1095–1097. PDFAbstract
Knowledge is not just power. Even if advance information can not influence an upcoming event, people (and animals) prefer to know ahead of time what the outcome will be. According to the firing patterns of neurons in the lateral habenula, from the brain's perspective, knowledge is also water—or at least its equivalent in terms of reward.
Takahashi, Y. K., Roesch, M. R., Wilson, R. C., Toreson, K., O'Donnell, P., Niv, Y., & Schoenbaum, G. (2011). Expectancy-related changes in firing of dopamine neurons depend on orbitofrontal cortex. Nature Neuroscience , 14 (12), 1590–1597. PDFAbstract
Nature Neuroscience 14, 1590 (2011). doi:10.1038/nn.2957
Eldar, E., Morris, G., & Niv, Y. (2011). The effects of motivation on response rate: A hidden semi-Markov model analysis of behavioral dynamics. Journal of Neuroscience Methods , 201 (1), 251–261. PDFAbstract
A central goal of neuroscience is to understand how neural dynamics bring about the dynamics of behavior. However, neural and behavioral measures are noisy, requiring averaging over trials and subjects. Unfortunately, averaging can obscure the very dynamics that we are interested in, masking abrupt changes and artificially creating gradual processes. We develop a hidden semi-Markov model for precisely characterizing dynamic processes and their alteration due to experimental manipulations. This method takes advantage of multiple trials and subjects without compromising the information available in individual events within a trial. We apply our model to studying the effects of motivation on response rates, analyzing data from hungry and sated rats trained to press a lever to obtain food rewards on a free-operant schedule. Our method can accurately account for punctate changes in the rate of responding and for sequential dependencies between responses. It is ideal for inferring the statistics of underlying response rates and the probability of switching from one response rate to another. Using the model, we show that hungry rats have more distinct behavioral states that are characterized by high rates of responding and they spend more time in these high-press-rate states. Moreover, hungry rats spend less time in, and have fewer distinct states that are characterized by a lack of responding (Waiting/Eating states). These results demonstrate the utility of our analysis method, and provide a precise quantification of the effects of motivation on response rates. ©2011 Elsevier B.V.
Ribas-Fernandes, J. J. F., Solway, A., Diuk, C., McGuire, J. T., Barto, A. G., Niv, Y., & Botvinick, M. M. (2011). A neural signature of hierarchical reinforcement learning. Neuron , 71 (2), 370–379. PDFAbstract
Human behavior displays hierarchical structure: simple actions cohere into subtask sequences, which work together to accomplish overall task goals. Although the neural substrates of such hierarchy have been the target of increasing research, they remain poorly understood. We propose that the computations supporting hierarchical behavior may relate to those in hierarchical reinforcement learning (HRL), a machine-learning framework that extends reinforcement-learning mechanisms into hierarchical domains. To test this, we leveraged a distinctive prediction arising from HRL. In ordinary reinforcement learning, reward prediction errors are computed when there is an unanticipated change in the prospects for accomplishing overall task goals. HRL entails that prediction errors should also occur in relation to task subgoals. In three neuroimaging studies we observed neural responses consistent with such subgoal-related reward prediction errors, within structures previously implicated in reinforcement learning. The results reported support the relevance of HRL to the neural processes underlying hierarchical behavior.
McDannald, M. A., Lucantonio, F., Burke, K. A., Niv, Y., & Schoenbaum, G. (2011). Ventral Striatum and Orbitofrontal Cortex Are Both Required for Model-Based, But Not Model-Free, Reinforcement Learning. Journal of Neuroscience , 31 (7), 2700–2705. PDFAbstract
In many cases, learning is thought to be driven by differences between the value of rewards we expect and rewards we actually receive. Yet learning can also occur when the identity of the reward we receive is not as expected, even if its value remains unchanged. Learning from changes in reward identity implies access to an internal model of the environment, from which information about the identity of the expected reward can be derived. As a result, such learning is not easily accounted for by model-free reinforcement learning theories such as temporal difference reinforcement learning (TDRL), which predicate learning on changes in reward value, but not identity. Here, we used unblocking procedures to assess learning driven by value- versus identity-based prediction errors. Rats were trained to associate distinct visual cues with different food quantities and identities. These cues were subsequently presented in compound with novel auditory cues and the reward quantity or identity was selectively changed. Unblocking was assessed by presenting the auditory cues alone in a probe test. Consistent with neural implementations of TDRL models, we found that the ventral striatum was necessary for learning in response to changes in reward value. However, this area, along with orbitofrontal cortex, was also required for learning driven by changes in reward identity. This observation requires that existing models of TDRL in the ventral striatum be modified to include information about the specific features of expected outcomes derived from model-based representations, and that the role of orbitofrontal cortex in these models be clearly delineated.
2010
Wilson, R. C., Takahashi, Y. K., Roesch, M. R., Stalnaker, T. A., Schoenbaum, G., & Niv, Y. (2010). A Computational Model of the Role of Orbitofrontal Cortex and Ventral Striatum in Signalling Reward Expectancy in Reinforcement Learning. In Society for Neuroscience Abstracts. PDFAbstract
Both the orbitofrontal cortex (OFC) and ventral striatum (VS) have been implicated in signalling reward expectancies. However, exactly what roles these two disparate structures play, and how they are different is very much an open question. Recent results from the Schoenbaum lab (Takahashi et al., this meeting) describing the detailed effect of OFC lesions on putative reward prediction error signalling by midbrain dopaminergic neurons of rats, point to one possible delineation. Here we describe a reinforcement learning (RL) model of the Takahashi et al. results, that suggests related, but slightly different roles for the OFC and VS in signalling reward expectancies. We present an actor/critic model with one actor (putatively the dorsal striatum) and two critics (OFC and VS). We hypothesise that the VS critic learns state values relatively slowly and in a model free way, while OFC learns state values faster and in a model based way, using one step look ahead. Both areas contribute to a single prediction error signal, computed in ventral tegmental area (VTA), that is used to teach both critics and the actor. As they receive the same teaching signal, the critics, OFC and VS, essentially compete for the value of each state. Our model makes a number of predictions regarding the effects of OFC and VS lesions on the response properties of dopaminergic (putatively prediction error encoding) neurons in VTA. The model predicts that lesions to either VS or OFC result in persistent prediction errors to predictable rewards and diminished prediction errors on the omission of predictable rewards. At the time of a reward predicting cue, the model predicts that these lesions cause both positive and negative prediction errors to be diminished. When the animal is free to choose between a high and low valued option, we predict a difference between the effects of OFC and VS lesions. In the “unlesioned” model, because of the proposed look-ahead abilities of OFC, the model predicts differential signals at the time at which the decision is made, corresponding to whether the high or low valued option has been chosen. When the model-OFC is “lesioned”, however, these differential signals disappear as the model is no longer aware of the decision that will be made. This is not the case when model-VS is “lesioned” in which case the difference between high and low valued options persists. These predictions regarding OFC lesions are born out in Takahashi et al.'s experiments on rats.
Gershman, S. J., Blei, D. M., & Niv, Y. (2010). Context, Learning, and Extinction. Psychological Review , 117 (1), 197–209. PDFAbstract
A. Redish et al. (2007) proposed a reinforcement learning model of context-dependent learning and extinction in conditioning experiments, using the idea of "state classification" to categorize new observations into states. In the current article, the authors propose an interpretation of this idea in terms of normative statistical inference. They focus on renewal and latent inhibition, 2 conditioning paradigms in which contextual manipulations have been studied extensively, and show that online Bayesian inference within a model that assumes an unbounded number of latent causes can characterize a diverse set of behavioral results from such manipulations, some of which pose problems for the model of Redish et al. Moreover, in both paradigms, context dependence is absent in younger animals, or if hippocampal lesions are made prior to training. The authors suggest an explanation in terms of a restricted capacity to infer new causes.
Diuk, C., Botvinick, M. M., Barto, A. G., & Niv, Y. (2010). Hierarchical reinforcement learning: an fmri study of learning in a two-level gambling task. In Neuroscience Meeting Planner.
Todd, M. T., Cohen, J. D., & Niv, Y. (2010). Identifying internal representations of context in fMRI. In Society for Neuroscience Abstracts.
Dayan, P. D., Daw, N. D., & Niv, Y. (2010). Learning, Action, Inference and Neuromodulation. Encyclopedia of Neuroscience (pp. 455–462). PDFAbstract
Neuromodulators such as dopamine, serotonin, norepinephrine, and acetylcholine play a critical role in controlling the learning about the association between stimuli and actions and rewards and punishments, and also in regulating the way that networks process information and effect decisions. Precise characterization of at least one part of their function in these respects comes from theories originating in statistics, operations research, and engineering, and computational theory ties together neural, psychological, ethological, and even economic notions. ©2009 Elsevier Ltd All rights reserved.
Gershman, S. J., & Niv, Y. (2010). Learning latent structure: carving nature at its joints. Curr Opin Neurobiol , 20 (2), 251–256. PDFAbstract
Reinforcement learning (RL) algorithms provide powerful explanations for simple learning and decision-making behaviors and the functions of their underlying neural substrates. Unfortunately, in real-world situations that involve many stimuli and actions, these algorithms learn pitifully slowly, exposing their inferiority in comparison to animal and human learning. Here we suggest that one reason for this discrepancy is that humans and animals take advantage of structure that is inherent in real-world tasks to simplify the learning problem. We survey an emerging literature on 'structure learning'–using experience to infer the structure of a task–and how this can be of service to RL, with an emphasis on structure in perception and action.
Gershman, S. J., Cohen, J. D., & Niv, Y. (2010). Learning to selectively attend. In 32nd Annual Conference of the Cognitive Science Society. PDFAbstract
How is reinforcement learning possible in a high-dimensional world? Without making any assumptions about the struc- ture of the state space, the amount of data required to effec- tively learn a value function grows exponentially with the state space’s dimensionality. However, humans learn to solve high- dimensional problems much more rapidly than would be ex- pected under this scenario. This suggests that humans em- ploy inductive biases to guide (and accelerate) their learning. Here we propose one particular bias—sparsity—that amelio- rates the computational challenges posed by high-dimensional state spaces, and present experimental evidence that humans can exploit sparsity information when it is available.
Niv, Y., & Gershman, S. J. (2010). Representation Learning and Reinforcement Learning : An fMRI study of learning to selectively attend. In Society for Neuroscience Abstracts.
2009
Niv, Y., & Montague, R. P. (2009). Theoretical and Empirical Studies of Learning. Neuroeconomics , 331–351 . Elsevier. PDFAbstract
This chapter introduces the reinforcement learning framework and gives a brief background to the origins and history of reinforcement learning models of decision-making. Reinforcement learning provides a normative framework, within which conditioning can be analyzed. That is, this suggests a means by which optimal prediction and action selection can be achieved, and exposes explicitly the computations that must be realized in the service of these. In contrast to descriptive models that describe behavior as it is, normative models study behavior from the point of view of its hypothesized function-that is, they study behavior, as it should be if it were to accomplish specific goals in an optimal way. The appeal of normative models derives from several sources. Historically, the core ideas in reinforcement learning arose from two separate and parallel lines of research. One axis is mainly associated with Richard Sutton, formerly an undergraduate psychology major, and his PhD advisor, Andrew Barto, a computer scientist. Interested in artificial intelligence and agent-based learning, Sutton and Barto developed algorithms for reinforcement learning that were inspired by the psychological literature on Pavlovian and instrumental conditioning. ©2009 Elsevier Inc. All rights reserved.
Botvinick, M. M., Niv, Y., & Barto, A. G. (2009). Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition , 113 (3), 262–280. PDFAbstract
Research on human and animal behavior has long emphasized its hierarchical structure-the divisibility of ongoing behavior into discrete tasks, which are comprised of subtask sequences, which in turn are built of simple actions. The hierarchical structure of behavior has also been of enduring interest within neuroscience, where it has been widely considered to reflect prefrontal cortical functions. In this paper, we reexamine behavioral hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning. Specifically, we consider a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills. A close look at the components of hierarchical reinforcement learning suggests how they might map onto neural structures, in particular regions within the dorsolateral and orbital prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement learning might provide a complement to existing psychological models of hierarchically structured behavior. A particularly important question that hierarchical reinforcement learning brings to the fore is that of how learning identifies new action routines that are likely to provide useful building blocks in solving a wide range of future problems. Here and at many other points, hierarchical reinforcement learning offers an appealing framework for investigating the computational and neural underpinnings of hierarchically structured behavior.
Todd, M. T., Niv, Y., & Cohen, J. D. (2009). Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Ed.), Advances in Neural Information Processing Systems 21 (pp. 1689–1696). PDFAbstract
Working memory is a central topic of cognitive neuroscience because it is critical for solving real-world problems in which information from multiple temporally distant sources must be combined to generate appropriate behavior. However, an often neglected fact is that learning to use working memory effectively is itself a difficult problem. The Gating framework is a collection of psychological models that show how dopamine can train the basal ganglia and prefrontal cortex to form useful working memory representations in certain types of problems. We unite Gating with machine learning theory concerning the general problem of memory-based optimal control. We present a normative model that learns, by online temporal difference methods, to use working memory to maximize discounted future reward in partially observable settings. The model successfully solves a benchmark working memory problem, and exhibits limitations similar to those observed in humans. Our purpose is to introduce a concise, normative definition of high level cognitive concepts such as working memory and cognitive control in terms of maximizing discounted future rewards. 1
Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology , 53 (3), 139–154. PDFAbstract
A wealth of research focuses on the decision-making processes that animals and humans employ when selecting actions in the face of reward and punishment. Initially such work stemmed from psychological investigations of conditioned behavior, and explanations of these in terms of computational models. Increasingly, analysis at the computational level has drawn on ideas from reinforcement learning, which provide a normative framework within which decision-making can be analyzed. More recently, the fruits of these extensive lines of research have made contact with investigations into the neural basis of decision making. Converging evidence now links reinforcement learning to specific neural substrates, assigning them precise computational roles. Specifically, electrophysiological recordings in behaving animals and functional imaging of human decision-making have revealed in the brain the existence of a key reinforcement learning signal, the temporal difference reward prediction error. Here, we first introduce the formal reinforcement learning framework. We then review the multiple lines of evidence linking reinforcement learning to the function of dopaminergic neurons in the mammalian midbrain and to more recent data from human imaging experiments. We further extend the discussion to aspects of learning not associated with phasic dopamine signals, such as learning of goal-directed responding that may not be dopamine-dependent, and learning about the vigor (or rate) with which actions should be performed that has been linked to tonic aspects of dopaminergic signaling. We end with a brief discussion of some of the limitations of the reinforcement learning framework, highlighting questions for future research.
2008
Niv, Y., & Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in Cognitive Sciences , 12 (7), 265–272. PDFAbstract
The recognition that computational ideas from reinforcement learning are relevant to the study of neural circuits has taken the cognitive neuroscience community by storm. A central tenet of these models is that discrepancies between actual and expected outcomes can be used for learning. Neural correlates of such prediction-error signals have been observed now in midbrain dopaminergic neurons, striatum, amygdala and even prefrontal cortex, and models incorporating prediction errors have been invoked to explain complex phenomena such as the transition from goal-directed to habitual behavior. Yet, like any revolution, the fast-paced progress has left an uneven understanding in its wake. Here, we provide answers to ten simple questions about prediction errors, with the aim of exposing both the strengths and the limitations of this active area of neuroscience research. ©2008 Elsevier Ltd. All rights reserved.
Schiller, D., Levy, I., Niv, Y., LeDoux, J. E., & Phelps, E. A. (2008). From Fear to Safety and Back: Reversal of Fear in the Human Brain. Journal of Neuroscience , 28 (45), 11517–11525. PDFAbstract
Fear learning is a rapid and persistent process that promotes defense against threats and reduces the need to relearn about danger. However, it is also important to flexibly readjust fear behavior when circumstances change. Indeed, a failure to adjust to changing conditions may contribute to anxiety disorders. A central, yet neglected aspect of fear modulation is the ability to flexibly shift fear responses from one stimulus to another if a once-threatening stimulus becomes safe or a once-safe stimulus becomes threatening. In these situations, the inhibition of fear and the development of fear reactions co-occur but are directed at different targets, requiring accurate responding under continuous stress. To date, research on fear modulation has focused mainly on the shift from fear to safety by using paradigms such as extinction, resulting in a reduction of fear. The aim of the present study was to track the dynamic shifts from fear to safety and from safety to fear when these transitions occur simultaneously. We used functional neuroimaging in conjunction with a fear-conditioning reversal paradigm. Our results reveal a unique dissociation within the ventromedial prefrontal cortex between a safe stimulus that previously predicted danger and a "naive" safe stimulus. We show that amygdala and striatal responses tracked the fear-predictive stimuli, flexibly flipping their responses from one predictive stimulus to another. Moreover, prediction errors associated with reversal learning correlated with striatal activation. These results elucidate how fear is readjusted to appropriately track environmental changes, and the brain mechanisms underlying the flexible control of fear.
Dayan, P. D., & Niv, Y. (2008). Reinforcement learning: The Good, The Bad and The Ugly. Current Opinion in Neurobiology , 18 (2), 185–196. PDFAbstract
Reinforcement learning provides both qualitative and quantitative frameworks for understanding and modeling adaptive decision-making in the face of rewards and punishments. Here we review the latest dispatches from the forefront of this field, and map out some of the territories where lie monsters. ©2008 Elsevier Ltd. All rights reserved.
Takahashi, Y. K. (2008). Silencing the critics: understanding the effects of cocaine sensitization on dorsolateral and ventral striatum in the context of an actor/critic model. Frontiers in Neuroscience , 2 (1), 86–99. PDFAbstract
A critical problem in daily decision making is how to choose actions now in order to bring about rewards later. Indeed, many of our actions have long-term consequences, and it is important to not be myopic in balancing the pros and cons of different options, but rather to take into account both immediate and delayed consequences of actions. Failures to do so may be manifest as persistent, maladaptive decision-making, one example of which is addiction where behavior seems to be driven by the immediate positive experiences with drugs, despite the delayed adverse consequences. A recent study by Takahashi et al. (2007) investigated the effects of cocaine sensitization on decision making in rats and showed that drug use resulted in altered representations in the ventral striatum and the dorsolateral striatum, areas that have been implicated in the neural instantiation of a computational solution to optimal long-term actions selection called the Actor/Critic framework. In this Focus article we discuss their results and offer a computational interpretation in terms of drug-induced impairments in the Critic. We first survey the different lines of evidence linking the subparts of the striatum to the Actor/Critic framework, and then suggest two possible scenarios of breakdown that are suggested by Takahashi et al.'s (2007) data. As both are compatible with the current data, we discuss their different predictions and how these could be empirically tested in order to further elucidate (and hopefully inch towards curing) the neural basis of drug addiction.
2007
Niv, Y. (2007). Cost, benefit, tonic, phasic: What do response rates tell us about dopamine and motivation? Annals of the New York Academy of Sciences , 1104, 357–376. PDFAbstract
The role of dopamine in decision making has received much attention from both the experimental and computational communities. However, because reinforcement learning models concentrate on discrete action selection and on phasic dopamine signals, they are silent as to how animals decide upon the rate of their actions, and they fail to account for the prominent effects of dopamine on response rates. We suggest an extension to reinforcement learning models in which response rates are optimally determined by balancing the tradeoff between the cost of fast responding and the benefit of rapid reward acquisition. The resulting behavior conforms well with numerous characteristics of free-operant responding. More importantly, this framework highlights a role for a tonic signal corresponding to the net rate of rewards, in determining the optimal rate of responding. We hypothesize that this critical quantity is conveyed by tonic levels of dopamine, explaining why dopaminergic manipulations exert a global affect on response rates. We further suggest that the effects of motivation on instrumental rates of responding are mediated through its influence on the net reward rate, implying a tight coupling between motivational states and tonic dopamine. The relationships between phasic and tonic dopamine signaling, and between directing and energizing effects of motivation, as well as the implications for motivational control of habitual and goal-directed instrumental action selection, are discussed. ©2007 New York Academy of Sciences.
McClelland, D. C. (2007). The effects of motivation on behavior. Personality. . The Hebrew University of Jerusalem. PDFAbstract
This thesis provides a normative computational analysis of how motivation affects decision making. More specifically, we provide a reinforcement learning model of optimal self-paced (free-operant) learning and behavior, and use it to address three broad classes of questions: (1) Why do animals work harder in some instrumental tasks than in others? (2) How do motivational states affect responding in such tasks, particu- larly in those cases in which behavior is habitual, that is, when responding is insensitive to changes in the specific worth of its goals, such as a higher value of food when hungry rather than sated? and (3) Why do dopaminergic manipulations cause global changes in the vigor of responding, and how is this related to prominent accounts of the role of dopamine in providing basal ganglia and frontal cortical areas with a reward prediction error signal that can be used for learning to choose between actions? A fundamental question in behavioral neuroscience concerns the decision-making processes by which an- imals and humans select actions in the face of reward and punishment. In Chapter 1 we provide a brief overview of the current status of this research, focused on three themes: behavior, computation and neural substrates. In behavioral psychology, this question has been investigated through the paradigms of Pavlo- vian (classical) and instrumental (operant) conditioning, and much evidence has accumulated regarding the associations that control different aspects of learned behavior. The computational field of reinforcement learning has provided a normative framework within which conditioned behavior can be understood. In this, optimal action selection is based on predictions of long-run future consequences, such that decision making is aimed at maximizing rewards and minimizing punishment. Neuroscientific evidence from lesion studies, pharmacological manipulations and electrophysiological recordings in behaving animals have further pro- vided tentative links to neural structures underlying key computational constructs in these models. Most notably, much evidence suggests that the neuromodulator dopamine provides basal ganglia target structures with a reward prediction error that can influence learning and action selection, particularly in stimulus-driven habitual instrumental behavior. However, although reinforcement learning models have long promised to unify computational, psycholog- ical and neural accounts of appetitively conditioned behavior, we claim here that they suffer from a large theoretical oversight. While a bulk of data on animal conditioning comes from free-operant experiments measuring how fast animals will work for reinforcement, existing reinforcement learning models lack any notion of vigor or response rate, focusing instead only on competition between different responses, and so they are silent about these tasks. In Chapter 2 we first review the basic characteristics of free-operant behavior, illustrating the effects of reinforcement schedules on rates of responding. We then develop a rein- forcement learning model in which vigor selection is optimized together with response selection. The model suggests that subjects choose how vigorously to perform selected actions by optimally balancing the costs and benefits of different speeds of responding. Importantly, we show that this model accounts normatively for effects of reinforcement schedules on response rates, such as the fact that responding on ratio schedules is faster than responding on interval schedules that yield the same rate of reinforcement. Finally, the model highlights the importance of the net rate of rewards in quantifying the opportunity cost of time, and thus in determining response vigor. In Chapter 3 we flesh out the implications of this model for the motivational control of habitual behavior. In general, understanding the effects of motivation on instrumental action selection is fundamental to the study of decision making. Recent work has shown that motivational control can be used to divide instrumental behavior into two classes: ‘goal-directed' behavior is immediately sensitive to motivation-induced changes in the values of its specific consequences, while ‘habitual' behavior is not. Because habitual behavior constitutes a large proportion of our daily activities, it is thus important to ask how does motivation affect habitual behavior? That is, how can habitual behavior be performed such as to achieve motivationally relevant goals? In Chapter 3 we flesh out the implications of this model for the motivational control of habitual behavior. In general, understanding the effects of motivation on instrumental action selection is fundamental to the study of decision making. Recent work has shown that motivational control can be used to divide instrumental behavior into two classes: ‘goal-directed' behavior is immediately sensitive to motivation-induced changes in the values of its specific consequences, while ‘habitual' behavior is not. Because habitual behavior constitutes a large proportion of our daily activities, it is thus important to ask how does motivation affect habitual behavior? That is, how can habitual behavior be performed such as to achieve motivationally relevant goals? We start by defining motivation as a mapping from outcomes to utilities. Incorporating this into the com- putational framework of optimal response rates, we show that in general, the optimal effects of motivation on behavior should be two-fold: On the one hand, motivation should affect the choice between actions such that actions leading to those outcomes that are more highly valued are more probable. This corresponds to the traditional directing effect of motivation. On the other hand, by influencing the opportunity cost of time, motivation should affect the response rates of all chosen actions, irrespective of their specific outcomes, as in the decades-old (but controversial) idea that motivation energizes behavior. This global effect of moti- vation explains not only why hungry rats work harder for food, but also sheds light on the counterintuitive observation that they will sometimes work harder for water. Based on the computational view of habitual behavior as arising from cached values summarizing long-run reward predictions, we suggest that habitual action selection can direct responding properly only in those motivational states which pertained during behavioral training. However, this does not imply insensitivity to novel motivational states. In these, we propose that the outcome-independent, global effects of motivational can ‘energize' habitual actions, as a well-founded approximation to the optimal solution in a trained situation. That is, habitual response rates can be adjusted to the current motivational state, in a way that is optimal given the computational limitations of the habitual system. Our computational framework suggests that the energizing function of motivation is mediated by the ex- pected net rate of rewards. In Chapter 4, we put forth the hypothesis that this important quantity is reported by tonic levels of dopamine. Dopamine neurotransmission has long been known to exert a powerful influ- ence over the vigor, strength or rate of responding. However, there exists no clear understanding of the computational foundation for this effect. Previous reinforcement learning models of habitual behavior have indeed suggested an interpretation of the function of dopaminergic signals in the brain. However, these have concentrated only on the role of precisely timed phasic dopaminergic signals in learning the predictive value of different actions, and have ignored both tonic dopamine transmission and response vigor. Our tonic dopamine hypothesis focuses on the involvement of dopamine in the control of vigor, explaining why higher levels of dopamine are associated with globally higher response rates, ie, why, like motivation, dopamine ‘energizes' behavior. In this way, through the computational framework of optimal choice of response rates, we suggest an explanation of the motivational control of habitual behavior, on both the behavioral and the neural levels. Reinforcement learning models of animal learning are appealing not only because they provide a normative basis for decision-making, but also because they show that optimal action selection can be learned through online incremental experience with the environment, using only locally available information. To complete the picture of how dopamine influences free-operant learning and behavior, in Chapter 5 we describe an online algorithm of the type usually associated with instrumental learning and decision-making, which is suitable for learning to select actions and latencies according to our new framework. There are two major differences between learning in our model and previous online reinforcement learning algorithms: First, most prior applications have dealt with discounted reinforcement learning while we use average reward reinforcement learning. Second, unlike previous models that have focused on discrete action selection, the action space in our model is inherently continuous, as it includes a choice of response latency. We thus propose a new online learning algorithm that is specifically suitable for our needs. In this, building on the experimental characteristics of response latencies, we suggest a functional parameterization of the action space that drastically reduces the complexity of learning. Moreover, we suggest a formulation of online action selection in which response rates are directly affected by the net reward rate. We show that our algorithm learns to respond appropriately, and with nearly optimal latencies, and discuss its implications for the differences between the learning of interval and ratio schedules. In Chapter 6, the last of the main chapters, we deviate from the theoretical analysis of behavior, to describe two instrumental condi
Niv, Y., & Rivlin-Etzion, M. (2007). Parkinson's Disease: Fighting the Will? Journal of Neuroscience , 27 (44), 11777–11779. PDFAbstract
A phenomenon familiar to clinicians treating patients with Parkinson's disease (PD) is kinesia paradoxica: astonishing displays of sudden mobility and agility by otherwise akinetic PD patients in instances of emergency (for instance,
Niv, Y., Daw, N. D., Joel, D., & Dayan, P. D. (2007). Tonic dopamine: Opportunity costs and the control of response vigor. Psychopharmacology , 191 (3), 507–520. PDFAbstract
RATIONALE: Dopamine neurotransmission has long been known to exert a powerful influence over the vigor, strength, or rate of responding. However, there exists no clear understanding of the computational foundation for this effect; predominant accounts of dopamine's computational function focus on a role for phasic dopamine in controlling the discrete selection between different actions and have nothing to say about response vigor or indeed the free-operant tasks in which it is typically measured. OBJECTIVES: We seek to accommodate free-operant behavioral tasks within the realm of models of optimal control and thereby capture how dopaminergic and motivational manipulations affect response vigor. METHODS: We construct an average reward reinforcement learning model in which subjects choose both which action to perform and also the latency with which to perform it. Optimal control balances the costs of acting quickly against the benefits of getting reward earlier and thereby chooses a best response latency. RESULTS: In this framework, the long-run average rate of reward plays a key role as an opportunity cost and mediates motivational influences on rates and vigor of responding. We review evidence suggesting that the average reward rate is reported by tonic levels of dopamine putatively in the nucleus accumbens. CONCLUSIONS: Our extension of reinforcement learning models to free-operant tasks unites psychologically and computationally inspired ideas about the role of tonic dopamine in striatum, explaining from a normative point of view why higher levels of dopamine might be associated with more vigorous responding.
2006
Daw, N. D., Niv, Y., & Dayan, P. D. (2006). Actions, Policies, Values, and the Basal Ganglia. In E. Bezard (Ed.), Recent Breakthroughs in Basal Ganglia Research (pp. 111–130) . Nova Science Publishers Inc. PDF
Niv, Y., Daw, N. D., & Dayan, P. D. (2006). Choice values. Nature Neuroscience , 9 (8), 987–988. PDFAbstract
Dopaminergic neurons are thought to inform decisions by reporting errors in reward prediction. A new study reports dopaminergic responses as monkeys make choices, supporting one computational theory of appetitive learning.
Dayan, P. D., Niv, Y., Seymour, B., & Daw, N. D. (2006). The misbehavior of value and the discipline of the will. Neural Networks , 19 (8), 1153–1160. PDFAbstract
Most reinforcement learning models of animal conditioning operate under the convenient, though fictive, assumption that Pavlovian conditioning concerns prediction learning whereas instrumental conditioning concerns action learning. However, it is only through Pavlovian responses that Pavlovian prediction learning is evident, and these responses can act against the instrumental interests of the subjects. This can be seen in both experimental and natural circumstances. In this paper we study the consequences of importing this competition into a reinforcement learning context, and demonstrate the resulting effects in an omission schedule and a maze navigation task. The misbehavior created by Pavlovian values can be quite debilitating; we discuss how it may be disciplined. ©2006 Elsevier Ltd. All rights reserved.
Niv, Y., Edlund, J. A., Dayan, P. D., & O'Doherty, J. P. (2006). Neural correlates of risk-sensitivity: An fMRI study of instrumental choice behavior. In Society for Neuroscience Abstracts.
Niv, Y., Joel, D., & Dayan, P. D. (2006). A normative perspective on motivation. Trends in Cognitive Science , 10 (8), 375–381. PDFAbstract
Understanding the effects of motivation on instrumental action selection, and specifically on its two main forms, goal-directed and habitual control, is fundamental to the study of decision making. Motivational states have been shown to 'direct' goal-directed behavior rather straightforwardly towards more valuable outcomes. However, how motivational states can influence outcome-insensitive habitual behavior is more mysterious. We adopt a normative perspective, assuming that animals seek to maximize the utilities they achieve, and viewing motivation as a mapping from outcomes to utilities. We suggest that habitual action selection can direct responding properly only in motivational states which pertained during behavioral training. However, in novel states, we propose that outcome-independent, global effects of the utilities can 'energize' habitual actions.
2005
Niv, Y., Duff, M. O., & Dayan, P. D. (2005). Dopamine, uncertainty and TD learning. Behavioral and Brain Functions , 1 6. PDFAbstract
Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively. However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric. We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors. In particular, we show that when averaging the non-stationary prediction errors across trials, a ramping in the activity of the dopamine neurons should be apparent, whose magnitude is dependent on the learning rate. This exact phenomenon was observed in a recent experiment, though being interpreted there in antipodal terms as a within-trial encoding of uncertainty.
Niv, Y., Daw, N. D., & Dayan, P. D. (2005). How fast to work: response vigor, motivation and tonic dopamine. In Y. Weiss, B. Schölkopf, & J. Platt (Ed.), Neural Information Processing Systems (Vol. 18, pp. 1019–1026) . MIT Press. PDFAbstract
Reinforcement learning models have long promised to unify computa- tional, psychological and neural accounts of appetitively conditioned be- havior. However, the bulk of data on animal conditioning comes from free-operant experiments measuring how fast animals will work for rein- forcement. Existing reinforcement learning (RL) models are silent about these tasks, because they lack any notion of vigor. They thus fail to ad- dress the simple observation that hungrier animals will work harder for food, as well as stranger facts such as their sometimes greater produc- tivity even when working for irrelevant outcomes such as water. Here, we develop an RL framework for free-operant behavior, suggesting that subjects choose how vigorously to perform selected actions by optimally balancing the costs and benefits of quick responding. Motivational states such as hunger shift these factors, skewing the tradeoff. This accounts normatively for the effects of motivation on response rates, as well as many other classic findings. Finally, we suggest that tonic levels of dopamine may be involved in the computation linking motivational state to optimal responding, thereby explaining the complex vigor-related ef- fects of pharmacological manipulation of dopamine.

Pages