Joint Language-Vision Inference in Machines and Humans

Jeffrey Mark Siskind, School of Electrical and Computer Engineering, Purdue University

Wednesday, October 4, 1:00-2:00pm; MCS 148

Abstract: I will present several frameworks for performing joint inference across vision, language, and motor control. The first is a unified cost function relating video, sentences, and a lexicon. Multidirectional inference supports video captioning, producing sentences from video and a lexicon, video retrieval, searching for video given a sentential query and a lexicon, and language acquisition, learning a lexicon from sententially annotated video. The second is a unified cost function relating mobile robot navigation, sentences, and a lexicon. Multidirectional inference supports language acquisition, learning a lexicon from sententially annotated navigational paths, generation, producing sentential descriptions of mobile robot paths driven under teleoperation, and comprehension, automatically driving a mobile robot given sentential description of route plans. The third uses sentential annotation to assist video object codiscovery. Joint inference between video and language can be used to discover objects without any pretrained object detector models from a small number of example videos that have been annotated with sentential description but no object bonding boxes. Finally, I will present investigation of how the human brain performs joint inference between language and vision. FMRI studies allow training computer models to recover semantic content from brain scans. We can train models solely on subjects watching video and use the models to recover semantic content from brain scans of different subjects reading sentences. We can similarly train models solely on subjects reading sentences and use the models to recover semantic content from brain scans of different subjects watching video. The ability to perform cross modal and cross subject decoding, as well as the significant overlap in brain regions used by the models, points to a common semantic representation employed by the human brain across modality and subject.

Bio: Jeffrey M. Siskind received the B.A. degree in computer science from the Technion, Israel Institute of Technology, Haifa, in 1979, the S.M. degree in computer science from the Massachusetts Institute of Technology (M.I.T.), Cambridge, in 1989, and the Ph.D. degree in computer science from M.I.T. in 1992. He did a postdoctoral fellowship at the University of Pennsylvania Institute for Research in Cognitive Science from 1992 to 1993. He was an assistant professor at the University of Toronto Department of Computer Science from 1993 to 1995, a senior lecturer at the Technion Department of Electrical Engineering in 1996, a visiting assistant professor at the University of Vermont Department of Computer Science and Electrical Engineering from 1996 to 1997, and a research scientist at NEC Research Institute, Inc. from 1997 to 2001. He joined the Purdue University School ofElectrical and Computer Engineering in 2002 where he is currently an associate professor. His research interests include computer vision, robotics, artificial intelligence, neuroscience, cognitive science, computational linguistics, child language acquisition, automatic differentiation, and programming languages and compilers.

[Back to AIR Initiative webpage]