Unsupervised Neural Network Models of the Ventral Visual Stream

Abstract

This presentation reviews the paper “Unsupervised Neural Network Models of the Ventral Visual Stream” by Zhuang et al. Deep Convolutional Neural Networks (DCNNs) have achieved success in approximating the adult primate visual ventral stream, yielding accurate predictive models of neural responses in early (V1), intermediate (V4), and higher (IT) cortical areas. However, supervised methods requiring millions of semantic labels are implausible as models of biological visual development, since infants do not have access to such labels. This work investigates unsupervised contrastive embedding methods, particularly Local Aggregation (LA), which learns representations by minimizing distance to “close” embedding points while maximizing distance to “background” points. The authors demonstrate that these unsupervised methods achieve neural prediction accuracy equal to or exceeding supervised approaches across multiple ventral visual cortical areas. Furthermore, they introduce Video Instance Embedding (VIE), an extension of LA that learns from noisy, realistic video datastreams (SAYCam dataset from head-mounted cameras on children), approaching the neural predictivity of ImageNet-trained models. Finally, semi-supervised methods like Local Label Propagation (LLP) are shown to leverage small numbers of labeled examples to produce representations with improved behavioral consistency to human perception.