Reinforcement Learning Model With Dynamic State Space Tested on Target Search Tasks for Monkeys: Self-Determination of Previous States Based on Experience Saturation and Decision Uniqueness

Overview

Journal Front Comput Neurosci

Specialty Biology

Date 2022 Feb 21

PMID 35185502

Authors

Tokio Katakura

Mikihiro Yoshida

Haruki Hisano

Hajime Mushiake

Kazuhiro Sakamoto

Affiliations

Soon will be listed here.

Abstract

The real world is essentially an indefinite environment in which the probability space, i. e., what can happen, cannot be specified in advance. Conventional reinforcement learning models that learn under uncertain conditions are given the state space as prior knowledge. Here, we developed a reinforcement learning model with a dynamic state space and tested it on a two-target search task previously used for monkeys. In the task, two out of four neighboring spots were alternately correct, and the valid pair was switched after consecutive correct trials in the exploitation phase. The agent was required to find a new pair during the exploration phase, but it could not obtain the maximum reward by referring only to the single previous one trial; it needed to select an action based on the two previous trials. To adapt to this task structure without prior knowledge, the model expanded its state space so that it referred to more than one trial as the previous state, based on two explicit criteria for appropriateness of state expansion: experience saturation and decision uniqueness of action selection. The model not only performed comparably to the ideal model given prior knowledge of the task structure, but also performed well on a task that was not envisioned when the models were developed. Moreover, it learned how to search rationally without falling into the exploration-exploitation trade-off. For constructing a learning model that can adapt to an indefinite environment, the method of expanding the state space based on experience saturation and decision uniqueness of action selection used by our model is promising.

Citing Articles

Reinforcement Learning Model With Dynamic State Space Tested on Target Search Tasks for Monkeys: Extension to Learning Task Events.

Sakamoto K, Yamada H, Kawaguchi N, Furusawa Y, Saito N, Mushiake H Front Comput Neurosci. 2022; 16:784604.

PMID: 35720772 PMC: 9201426. DOI: 10.3389/fncom.2022.784604.

References

Shima K, Tanji J . Role for cingulate motor area cells in voluntary movement selection based on reward. Science. 1998; 282(5392):1335-8. DOI: 10.1126/science.282.5392.1335. View

Mushiake H, Saito N, Sakamoto K, Sato Y, Tanji J . Visually based path-planning by Japanese monkeys. Brain Res Cogn Brain Res. 2001; 11(1):165-9. DOI: 10.1016/s0926-6410(00)00067-7. View

Friston K . The free-energy principle: a unified brain theory?. Nat Rev Neurosci. 2010; 11(2):127-38. DOI: 10.1038/nrn2787. View

Rescorla R, Solomon R . Two-process learning theory: Relationships between Pavlovian conditioning and instrumental learning. Psychol Rev. 1967; 74(3):151-82. DOI: 10.1037/h0024475. View

Sakamoto K, Saito N, Yoshida S, Mushiake H . Dynamic Axis-Tuned Cells in the Monkey Lateral Prefrontal Cortex during a Path-Planning Task. J Neurosci. 2019; 40(1):203-219. PMC: 6939495. DOI: 10.1523/JNEUROSCI.2526-18.2019. View

Kawaguchi N, Sakamoto K, Saito N, Furusawa Y, Tanji J, Aoki M . Surprise signals in the supplementary eye field: rectified prediction errors drive exploration-exploitation transitions. J Neurophysiol. 2014; 113(3):1001-14. DOI: 10.1152/jn.00128.2014. View

Schulz L, Sommerville J . God does not play dice: causal determinism and preschoolers' causal inferences. Child Dev. 2006; 77(2):427-42. DOI: 10.1111/j.1467-8624.2006.00880.x. View

Sakamoto K, Kawaguchi N, Mushiake H . Differences in task-phase-dependent time-frequency patterns of local field potentials in the dorsal and ventral regions of the monkey lateral prefrontal cortex. Neurosci Res. 2020; 156:41-49. DOI: 10.1016/j.neures.2019.12.016. View

Friston K, Daunizeau J, Kiebel S . Reinforcement learning or active inference?. PLoS One. 2009; 4(7):e6421. PMC: 2713351. DOI: 10.1371/journal.pone.0006421. View

10.

HARLOW H . Learning and satiation of response in intrinsically motivated complex puzzle performance by monkeys. J Comp Physiol Psychol. 1950; 43(4):289-94. DOI: 10.1037/h0058114. View

11.

Sakamoto K, Mushiake H, Saito N, Aihara K, Yano M, Tanji J . Discharge synchrony during the transition of behavioral goal representations encoded by discharge rates of prefrontal neurons. Cereb Cortex. 2008; 18(9):2036-45. PMC: 2517111. DOI: 10.1093/cercor/bhm234. View

12.

Doshi-Velez F, Pfau D, Wood F, Roy N . Bayesian Nonparametric Methods for Partially-Observable Reinforcement Learning. IEEE Trans Pattern Anal Mach Intell. 2015; 37(2):394-407. DOI: 10.1109/TPAMI.2013.191. View

13.

Friston K . The free-energy principle: a rough guide to the brain?. Trends Cogn Sci. 2009; 13(7):293-301. DOI: 10.1016/j.tics.2009.04.005. View

14.

Sakamoto K, Kawaguchi N, Yagi K, Mushiake H . Spatiotemporal patterns of current source density in the prefrontal cortex of a behaving monkey. Neural Netw. 2014; 62:67-72. DOI: 10.1016/j.neunet.2014.06.009. View

15.

Shima K, Isoda M, Mushiake H, Tanji J . Categorization of behavioural sequences in the prefrontal cortex. Nature. 2006; 445(7125):315-8. DOI: 10.1038/nature05470. View