Large Language Models

Fast LLM Serving with vLLM and PagedAttention

September 19, 2:30 PM - 3:00 PM

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. To address this problem, we are developing vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention achieves up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes. vLLM has been developed at UC Berkeley and deployed for Chatbot Arena and Vicuna Demo for the past three months. In this talk, we will discuss the motivation, features, and implementation of vLLM in depth, and present our future plan.

About Woosuk

Woosuk Kwon is a CS PhD student at UC Berkeley, where he is advised by Prof. Ion Stoica. His research interest lies in building practical, flexible, and high-performance software systems for emerging applications such as large language models. He completed his BS at Seoul National University.

About Zhuohan

Zhuohan Li is a CS PhD student at UC Berkeley, where he is advised by Professor Ion Stoica. He is interested in designing and building efficient machine-learning systems. Recently, he is focusing on the training and serving of large models, specifically LLMs. His works include Alpa, AlpaServe, Vicuna, and vLLM (PagedAttention). He completed his BS at Peking University and has interned at Microsoft Research, Anyscale, and Google Brain.

Woosuk Kwon

CS PhD Student, UC Berkeley

Zhuohan Li

CS PhD Student, UC Berkeley
Photo of Ray Summit pillows
Ray Summit 23 logo

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Photo of Ray pillows and Raydiate sign
Photo of Raydiate sign

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.