Large Language Models

Enabling Cost-Efficient LLM Serving with Ray Serve

September 19, 2:30 PM - 3:00 PM

Ray Serve is the cheapest and easiest way to deploy LLMs, and has served billions of tokens in Anyscale Endpoints. This talk discusses how Ray Serve reduces cost via fine-grained autoscaling, continuous batching, and model parallel inference, as well as the work we've done to make it easy to deploy any Hugging Face model with these optimizations.

Takeaways:

• Learn how Ray Serve saves costs by using fewer GPUs with finegrained autoscaling and integrating with libraries like VLLM to maximize GPU utilization.

About Tanmay

Tanmay Chordia is an engineer building tools to take ml models to production. He works on Anyscale Services and Ray Serve.

Tanmay writes software to help organizations derive business value from cutting edge ML by bridging the gap between research and production. He focuses on designing robust and user friendly systems for real time serving.

About Cade

Cade Daniel is a software engineer at Anyscale working on Ray and LLMs. Previously, he helped build the communication engine for training large language models using AWS SageMaker's model parallelism library. Outside of work, he enjoys sipping a good latte while liking hot takes on ML/AI Twitter.

Tanmay Chordia

Software Engineer, Anyscale

Cade Daniel

Software Engineer, Anyscale

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.

Join the channel

Contribute to Ray

Ask a question