» Articles » PMID: 39040066

Fine-grained Knowledge About Manipulable Objects is Well-predicted by Contrastive Language Image Pre-training

Overview
Journal iScience
Publisher Cell Press
Date 2024 Jul 23
PMID 39040066
Authors
Affiliations
Soon will be listed here.
Abstract

Object recognition is an important ability that relies on distinguishing between similar objects (e.g., deciding which utensil(s) to use at different stages of meal preparation). Recent work describes the fine-grained organization of knowledge about manipulable objects via the study of the constituent dimensions that are most relevant to human behavior, for example, vision, manipulation, and function-based properties. A logical extension of this work concerns whether or not these dimensions are uniquely human, or can be approximated by deep learning. Here, we show that behavioral dimensions are generally well-predicted by CLIP-ViT - a multimodal network trained on a large and diverse set of image-text pairs. Moreover, this model outperforms comparison networks pre-trained on smaller, image-only datasets. These results demonstrate the impressive capacity of CLIP-ViT to approximate fine-grained object knowledge. We discuss the possible sources of this benefit relative to other models (e.g., multimodal vs. image-only pre-training, dataset size, architecture).

References
1.
Huth A, Nishimoto S, Vu A, Gallant J . A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron. 2012; 76(6):1210-24. PMC: 3556488. DOI: 10.1016/j.neuron.2012.10.014. View

2.
Bracci S, Mraz J, Zeman A, Leys G, Op de Beeck H . The representational hierarchy in human and artificial visual systems in the presence of object-scene regularities. PLoS Comput Biol. 2023; 19(4):e1011086. PMC: 10171658. DOI: 10.1371/journal.pcbi.1011086. View

3.
Downing P, Chan A, Peelen M, Dodds C, Kanwisher N . Domain specificity in visual cortex. Cereb Cortex. 2005; 16(10):1453-61. DOI: 10.1093/cercor/bhj086. View

4.
Juttner M, Muller A, Rentschler I . A developmental dissociation of view-dependent and view-invariant object recognition in adolescence. Behav Brain Res. 2006; 175(2):420-4. DOI: 10.1016/j.bbr.2006.09.005. View

5.
Martens F, Bulthe J, van Vliet C, Op de Beeck H . Domain-general and domain-specific neural changes underlying visual expertise. Neuroimage. 2017; 169:80-93. PMC: 5864513. DOI: 10.1016/j.neuroimage.2017.12.013. View