Dissertation Defense

Towards Improving Deep Vision and Language Navigation

Shurjo Banerjee

As roboticists, we look forward to the days in which robots will be ubiquitous in our lives. For robots to do this, they will have to integrate into our lives seamlessly and easily. Fundamental to this integration will be that they must possess the abilities to understand and communicate with us using our language. In this work we are primarily concerned with embodied approaches to this problem and are specifically interested in Vision and Language Navigation (VLN) where agents are tasked with instruction following in complex environments. In VLN much work has delved into the creation of agents that follow human instructions in photo-realistic simulators with the overall goal being the eventual transfer of learned representations and knowledge to real world robots.

In this work we ask three fundamental questions regarding the state of VLN: (a) How do you measure the “navigational” abilities of an embodied agent? (b) Is data collected in simulated environments for the task of VLN reflective of data collected in the real world? (c) Are the evaluation paradigms utilized by VLN satisfactory and reflective of how real world robots would be expected to operate?

To answer the first we introduce an open-source suite of experiments designed to analyze the navigational abilities of embodied agents. Here we define navigation to be a measure of an agent’s abilities to exploit information in regions of an environment it has seen before. Aside from the suite we also introduce new metrics to analyze an agent’s exploitative abilities. To answer the second question, we introduce the RobotSLANG benchmark, a real world dataset of human-generated dialog for robotic control. RobotSLANG stands as an alternative to simulator driven data collection and highlights fundamental differences between data collected in these two paradigms. We introduce the tasks of Localization from Dialog History and Navigation from Dialog History which are learning problems that can be trained using this data. To answer the third question, we propose an update to the Vision-and-Language Navigation training and evaluation paradigm that encourages embodied agents to remember long-term consistencies in their environment and exploit this information as they attempt to follow instructions. The new paradigm, Iterative Vision and Language Navigation in Continuous Environments (I-VLN-CE) encourages the creation of models that are more similar to how robots would be expected to operate in real-world scenarios. We conclude with a discussion on how VLN can be improved and extended in future work with the hope that the days in which robots can communicate with us and be a part of our lives draws ever nearer.

Co-Chairs: Dr. Jason J. Corso and Dr. Andrew Owens