Ray Serve is the cheapest and easiest way to deploy LLMs, and has served billions of tokens in Anyscale Endpoints. This talk discusses how Ray Serve reduces cost via fine-grained autoscaling, continuous batching, and model parallel inference, as well as the work we've done to make it easy to deploy any Hugging Face model with these optimizations.
Takeaways:
• Learn how Ray Serve saves costs by using fewer GPUs with finegrained autoscaling and integrating with libraries like VLLM to maximize GPU utilization.
Tanmay Chordia is an engineer building tools to take ml models to production. He works on Anyscale Services and Ray Serve.
Tanmay writes software to help organizations derive business value from cutting edge ML by bridging the gap between research and production. He focuses on designing robust and user friendly systems for real time serving.
Cade Daniel is a software engineer at Anyscale working on Ray and LLMs. Previously, he helped build the communication engine for training large language models using AWS SageMaker's model parallelism library. Outside of work, he enjoys sipping a good latte while liking hot takes on ML/AI Twitter.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.