Communications and Signal Processing Seminar
Sparsity for Efficient Long Sequence Generation of LLMs
This event is free and open to the publicAdd to Google Calendar
Abstract: Large language models (LLMs) have sparked a new wave of exciting AI applications, but they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM’s in-context learning ability, or do not yield wall-clock time speedup on modern hardware. In this talk, I will show how sparsity can help overcome two major bottlenecks in LLM inference, model and KV cache IOs, and unlock the possibility of handling infinitely long sequences.
First, we show Heavy-Hitter Oracle (H2O), a KV cache eviction policy that drastically reduces the memory footprint of these transient states. Our approach is based on an observation that a small portion of tokens contributes most of the value when computing attention scores – Heavy-Hitters. H2O improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29x, 29x, and 3x on OPT-6.7B and OPT-30B. With the same batch size, H_2O can reduce the latency by up to 1.9x.
Then we present Streaming LLM, a simplification to H2O based on a further finding on heavy hitters called attention sink – only keeping the KV of initial tokens will largely recover the LLM performance. It enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. Specifically, StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
Finally, we present Dejavu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that reduces model weight loading IOs. Dejavu can reduce the inference latency of OPT-175B by over 2x compared to the state-of-the-art FasterTransformer, and over 6$x compared to the widely used Hugging Face implementation, without compromising model quality.
Bio: Beidi Chen is an Assistant Professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University. She is a Visiting Research Scientist at FAIR, Meta. Before that, she was a postdoctoral scholar at Stanford University. She received her Ph.D. from Rice University in 2020 and her B.S. from UC Berkeley in 2015. Her research focuses on efficient machine learning. Specifically, she designs and optimizes algorithms and models on modern hardware to accelerate large machine-learning systems. Her work has won the best paper runner-up at ICML 2022, a best paper award at IISA 2018, and a best paper award at USENIX LISA 2014. She was selected as a Rising Star in EECS by MIT in 2019 and UIUC in 2021.
*** The event will take place in a hybrid format. The location for in-person attendance will be room 3427 EECS. Attendance will also be available via Zoom.
Zoom Passcode information is available upon request to Sher Nickrand ([email protected]).
This seminar will be recorded and posted to the CSP Seminar website.