Loading Events

Dissertation Defense

Situated Language Grounding for Multimodal AI Assistant Modeling

Yichi ZhangPh.D. Candidate
WHERE:
3941 Beyster BuildingMap
SHARE:

Hybrid Event: 3941 BBB / Zoom

Abstract: Building multimodal AI assistants that can perceive the physical world, communicate seamlessly with humans, and help with real-world tasks is a cornerstone of AI research. Situated language grounding—-the ability to connect language to rich, multimodal contexts—-is fundamental for developing assistants that can interpret human instructions, reason about their environment, and provide timely responses. Despite recent advances in large language models (LLMs), situated language grounding to the physical world remains a significant challenge. In this dissertation, I investigate situated language grounding across multiple dimensions, developing novel approaches for creating contextually aware AI assistants.

First, I address vision-language alignment by developing a multimodal large language model (MLLM) capable of holistic visual grounding, linking generated text phrases to pixel-level semantic regions across various granularities. Through novel architecture and comprehensive dataset curation with multi-grained annotations, our model outperforms prior approaches while providing interpretable insights into grounding failures.

Second, I explore language grounding for embodied decision-making through a neuro-symbolic deliberative agent for following natural language instructions. By constructing rich semantic representations of 3D environments and employing symbolic reasoning with explicit planning, this approach achieves more efficient task execution with greater transparency and robustness.

Finally, I study proactive assistant modeling with continuous visual perception from users’ first-person perspective, addressing language grounding in dynamic situations. This work introduces a novel formulation where assistants determine both when and how to communicate based on evolving visual contexts. I explore two complementary approaches: a dialogue synthesis method enabling situation-aware assistants trained on large amount of real-world videos, and a simulation-based framework for learning and evaluating grounded human-AI interactions. These advances provide foundations for training MLLMs to deliver real-time situated assistance across diverse real-world scenarios.

The collective contributions in visual grounding, embodied reasoning, and proactive assistance advance situated language understanding, establishing foundations for AI systems that effectively bridge language with the physical world.

Organizer

CSE Graduate Programs Office

Faculty Host

Prof. Joyce Chai