Fine-grained Knowledge About Manipulable Objects is Well-predicted by Contrastive Language Image Pre-training

Overview

Journal iScience

Publisher Cell Press

Date 2024 Jul 23

PMID 39040066

Authors

Jon Walbrin

Nikita Sossounov

Morteza Mahdiani

Igor Vaz

Jorge Almeida

Affiliations

Soon will be listed here.

Abstract

Object recognition is an important ability that relies on distinguishing between similar objects (e.g., deciding which utensil(s) to use at different stages of meal preparation). Recent work describes the fine-grained organization of knowledge about manipulable objects via the study of the constituent dimensions that are most relevant to human behavior, for example, vision, manipulation, and function-based properties. A logical extension of this work concerns whether or not these dimensions are uniquely human, or can be approximated by deep learning. Here, we show that behavioral dimensions are generally well-predicted by CLIP-ViT - a multimodal network trained on a large and diverse set of image-text pairs. Moreover, this model outperforms comparison networks pre-trained on smaller, image-only datasets. These results demonstrate the impressive capacity of CLIP-ViT to approximate fine-grained object knowledge. We discuss the possible sources of this benefit relative to other models (e.g., multimodal vs. image-only pre-training, dataset size, architecture).

References

Huth A, Nishimoto S, Vu A, Gallant J . A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron. 2012; 76(6):1210-24. PMC: 3556488. DOI: 10.1016/j.neuron.2012.10.014. View

Bracci S, Mraz J, Zeman A, Leys G, Op de Beeck H . The representational hierarchy in human and artificial visual systems in the presence of object-scene regularities. PLoS Comput Biol. 2023; 19(4):e1011086. PMC: 10171658. DOI: 10.1371/journal.pcbi.1011086. View

Downing P, Chan A, Peelen M, Dodds C, Kanwisher N . Domain specificity in visual cortex. Cereb Cortex. 2005; 16(10):1453-61. DOI: 10.1093/cercor/bhj086. View

Juttner M, Muller A, Rentschler I . A developmental dissociation of view-dependent and view-invariant object recognition in adolescence. Behav Brain Res. 2006; 175(2):420-4. DOI: 10.1016/j.bbr.2006.09.005. View

Martens F, Bulthe J, van Vliet C, Op de Beeck H . Domain-general and domain-specific neural changes underlying visual expertise. Neuroimage. 2017; 169:80-93. PMC: 5864513. DOI: 10.1016/j.neuroimage.2017.12.013. View

Mahon B, Caramazza A . Concepts and categories: a cognitive neuropsychological perspective. Annu Rev Psychol. 2008; 60:27-51. PMC: 2908258. DOI: 10.1146/annurev.psych.60.110707.163532. View

Muttenthaler L, Hebart M . THINGSvision: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks. Front Neuroinform. 2021; 15:679838. PMC: 8494008. DOI: 10.3389/fninf.2021.679838. View

Mukherjee K, Rogers T . Using drawings and deep neural networks to characterize the building blocks of human visual similarity. Mem Cognit. 2024; 53(1):219-241. DOI: 10.3758/s13421-024-01580-1. View

Kriegeskorte N, Mur M, Bandettini P . Representational similarity analysis - connecting the branches of systems neuroscience. Front Syst Neurosci. 2008; 2:4. PMC: 2605405. DOI: 10.3389/neuro.06.004.2008. View

10.

Lee D, Almeida J . Within-category representational stability through the lens of manipulable objects. Cortex. 2021; 137:282-291. DOI: 10.1016/j.cortex.2020.12.026. View

11.

Mehrer J, Spoerer C, Jones E, Kriegeskorte N, Kietzmann T . An ecologically motivated image dataset for deep learning yields better models of human vision. Proc Natl Acad Sci U S A. 2021; 118(8). PMC: 7923360. DOI: 10.1073/pnas.2011417118. View

12.

Walbrin J, Downing P, Dourado Sotero F, Almeida J . Characterizing the discriminability of visual categorical information in strongly connected voxels. Neuropsychologia. 2024; 195:108815. DOI: 10.1016/j.neuropsychologia.2024.108815. View

13.

Peelen M, He C, Han Z, Caramazza A, Bi Y . Nonvisual and visual object shape representations in occipitotemporal cortex: evidence from congenitally blind and sighted adults. J Neurosci. 2014; 34(1):163-70. PMC: 6608164. DOI: 10.1523/JNEUROSCI.1114-13.2014. View

14.

Almeida J, Fracasso A, Kristensen S, Valerio D, Bergstrom F, Chakravarthi R . Neural and behavioral signatures of the multidimensionality of manipulable object processing. Commun Biol. 2023; 6(1):940. PMC: 10502059. DOI: 10.1038/s42003-023-05323-x. View

15.

Bova S, Fazzi E, Giovenzana A, Montomoli C, Signorini S, Zoppello M . The development of visual object recognition in school-age children. Dev Neuropsychol. 2007; 31(1):79-102. DOI: 10.1207/s15326942dn3101_5. View

16.

Bilalic M, Grottenthaler T, Nagele T, Lindig T . The Faces in Radiological Images: Fusiform Face Area Supports Radiological Expertise. Cereb Cortex. 2014; 26(3):1004-1014. DOI: 10.1093/cercor/bhu272. View

17.

Walbrin J, Almeida J, Koldewyn K . Alternative Brain Connectivity Underscores Age-Related Differences in the Processing of Interactive Biological Motion. J Neurosci. 2023; 43(20):3666-3674. PMC: 10198447. DOI: 10.1523/JNEUROSCI.2109-22.2023. View

18.

Peelen M, Downing P . Category selectivity in human visual cortex: Beyond visual object recognition. Neuropsychologia. 2017; 105:177-183. DOI: 10.1016/j.neuropsychologia.2017.03.033. View

19.

Walbrin J, Almeida J . High-Level Representations in Human Occipito-Temporal Cortex Are Indexed by Distal Connectivity. J Neurosci. 2021; 41(21):4678-4685. PMC: 8260247. DOI: 10.1523/JNEUROSCI.2857-20.2021. View

20.

Bracci S, Op de Beeck H . Understanding Human Object Vision: A Picture Is Worth a Thousand Representations. Annu Rev Psychol. 2022; 74:113-135. DOI: 10.1146/annurev-psych-032720-041031. View