Reinforcement Learning Using a Continuous Time Actor-critic Framework with Spiking Neurons

Overview

Journal PLoS Comput Biol

Specialty Biology

Date 2013 Apr 18

PMID 23592970

Citations 52

Authors

Nicolas Fremaux

Henning Sprekeler

Wulfram Gerstner

Affiliations

Soon will be listed here.

Abstract

Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.

Citing Articles

An accurate and fast learning approach in the biologically spiking neural network.

Nazari S, Amiri M Sci Rep. 2025; 15(1):6585.

PMID: 39994277 PMC: 11850897. DOI: 10.1038/s41598-025-90113-0.

A spiking neural network for active efficient coding.

Barbier T, Teuliere C, Triesch J Front Robot AI. 2025; 11:1435197.

PMID: 39882552 PMC: 11775837. DOI: 10.3389/frobt.2024.1435197.

Global remapping emerges as the mechanism for renewal of context-dependent behavior in a reinforcement learning model.

Kappel D, Cheng S Front Comput Neurosci. 2025; 18:1462110.

PMID: 39881840 PMC: 11774835. DOI: 10.3389/fncom.2024.1462110.

Brain-inspired learning rules for spiking neural network-based control: a tutorial.

Lee C, Park Y, Yoon S, Lee J, Cho Y, Park C Biomed Eng Lett. 2025; 15(1):37-55.

PMID: 39781065 PMC: 11704115. DOI: 10.1007/s13534-024-00436-6.

Exploring spiking neural networks for deep reinforcement learning in robotic tasks.

Zanatta L, Barchi F, Manoni S, Tolu S, Bartolini A, Acquaviva A Sci Rep. 2024; 14(1):30648.

PMID: 39730367 PMC: 11680704. DOI: 10.1038/s41598-024-77779-8.

References

Legenstein R, Pecevski D, Maass W . A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback. PLoS Comput Biol. 2008; 4(10):e1000180. PMC: 2543108. DOI: 10.1371/journal.pcbi.1000180. View

Sheynikhovich D, Chavarriaga R, Strosslin T, Arleo A, Gerstner W . Is there a geometric module for spatial orientation? Insights from a rodent navigation model. Psychol Rev. 2009; 116(3):540-66. DOI: 10.1037/a0016170. View

Gerstner W, Kempter R, van Hemmen J, Wagner H . A neuronal learning rule for sub-millisecond temporal coding. Nature. 1996; 383(6595):76-81. DOI: 10.1038/383076a0. View

Robbins T, Roberts A . Differential regulation of fronto-executive function by the monoamines and acetylcholine. Cereb Cortex. 2007; 17 Suppl 1:i151-60. DOI: 10.1093/cercor/bhm066. View

Zhang J, Lau P, Bi G . Gain in sensitivity and loss in temporal contrast of STDP by dopaminergic modulation at hippocampal synapses. Proc Natl Acad Sci U S A. 2009; 106(31):13028-33. PMC: 2713390. DOI: 10.1073/pnas.0900546106. View

Pawlak V, Wickens J, Kirkwood A, Kerr J . Timing is not Everything: Neuromodulation Opens the STDP Gate. Front Synaptic Neurosci. 2011; 2:146. PMC: 3059689. DOI: 10.3389/fnsyn.2010.00146. View

Pawlak V, Kerr J . Dopamine receptor activation is required for corticostriatal spike-timing-dependent plasticity. J Neurosci. 2008; 28(10):2435-46. PMC: 6671189. DOI: 10.1523/JNEUROSCI.4402-07.2008. View

Vasilaki E, Fremaux N, Urbanczik R, Senn W, Gerstner W . Spike-based reinforcement learning in continuous state and action space: when policy gradient methods fail. PLoS Comput Biol. 2009; 5(12):e1000586. PMC: 2778872. DOI: 10.1371/journal.pcbi.1000586. View

Okeefe J, Dostrovsky J . The hippocampus as a spatial map. Preliminary evidence from unit activity in the freely-moving rat. Brain Res. 1971; 34(1):171-5. DOI: 10.1016/0006-8993(71)90358-1. View

10.

Arleo A, Gerstner W . Spatial cognition and neuro-mimetic navigation: a model of hippocampal place cell activity. Biol Cybern. 2000; 83(3):287-99. DOI: 10.1007/s004220000171. View

11.

Clopath C, Ziegler L, Vasilaki E, Busing L, Gerstner W . Tag-trigger-consolidation: a model of early and late long-term-potentiation and depression. PLoS Comput Biol. 2008; 4(12):e1000248. PMC: 2596310. DOI: 10.1371/journal.pcbi.1000248. View

12.

Gold J, Shadlen M . The neural basis of decision making. Annu Rev Neurosci. 2007; 30:535-74. DOI: 10.1146/annurev.neuro.29.051605.113038. View

13.

Schultz W, Dayan P, Montague P . A neural substrate of prediction and reward. Science. 1997; 275(5306):1593-9. DOI: 10.1126/science.275.5306.1593. View

14.

Potjans W, Diesmann M, Morrison A . An imperfect dopaminergic error signal can drive temporal-difference learning. PLoS Comput Biol. 2011; 7(5):e1001133. PMC: 3093351. DOI: 10.1371/journal.pcbi.1001133. View

15.

Cohen J, Haesler S, Vong L, Lowell B, Uchida N . Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature. 2012; 482(7383):85-8. PMC: 3271183. DOI: 10.1038/nature10754. View

16.

Pfister J, Toyoizumi T, Barber D, Gerstner W . Optimal spike-timing-dependent plasticity for precise action potential firing in supervised learning. Neural Comput. 2006; 18(6):1318-48. DOI: 10.1162/neco.2006.18.6.1318. View

17.

Loewenstein Y . Robustness of learning that is based on covariance-driven synaptic plasticity. PLoS Comput Biol. 2008; 4(3):e1000007. PMC: 2265526. DOI: 10.1371/journal.pcbi.1000007. View

18.

Frey U, Morris R . Synaptic tagging and long-term potentiation. Nature. 1997; 385(6616):533-6. DOI: 10.1038/385533a0. View

19.

Florian R . Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity. Neural Comput. 2007; 19(6):1468-502. DOI: 10.1162/neco.2007.19.6.1468. View

20.

Potjans W, Morrison A, Diesmann M . A spiking neural network model of an actor-critic learning agent. Neural Comput. 2009; 21(2):301-39. DOI: 10.1162/neco.2008.08-07-593. View