Ray Deep Dives

Best Practices for Productionizing Distributed Training with Ray Train

September 19, 1:00 PM - 1:30 PM
View Slides

With the emergence of large models, multi-node distributed training has become the norm. Given the scale and complexity required for workloads like training large language models, distributed training can encounter a wide variety of issues such as OOM or storage failure. More nodes involved in distributed training leads to more potential for node failures, so fault tolerance for machine learning training becomes even more important. Furthermore, fault tolerance provides an avenue to cut costs through utilizing spot instances and preserving training progress in the event of failures. In this tutorial, we will walk through how to enable fault tolerance with Ray Train, covering topics including experiment restoration, recovering from individual node failures, using persistent cloud storage to snapshot experiment state, and performing large model checkpointing. We will provide a set of simple additions you can incorporate into your Ray Train training application to leverage all the benefits of fault-tolerant model training.

About Justin

Justin is a software engineer at Anyscale, where he works on Ray AI Libraries. He is interested in making scalable AI more user-friendly and accessible, and he also has a passion for teaching and creating educational content. Prior to Anyscale, Justin graduated with a B.S. from UC Berkeley, where he did research on real-world robotic manipulation with reinforcement learning.

Justin Yu

Software Engineer, Anyscale
Photo of Ray Summit pillows
Ray Summit 23 logo

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Photo of Ray pillows and Raydiate sign
Photo of Raydiate sign

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.