With the emergence of large models, multi-node distributed training has become the norm. Given the scale and complexity required for workloads like training large language models, distributed training can encounter a wide variety of issues such as OOM or storage failure. More nodes involved in distributed training leads to more potential for node failures, so fault tolerance for machine learning training becomes even more important. Furthermore, fault tolerance provides an avenue to cut costs through utilizing spot instances and preserving training progress in the event of failures. In this tutorial, we will walk through how to enable fault tolerance with Ray Train, covering topics including experiment restoration, recovering from individual node failures, using persistent cloud storage to snapshot experiment state, and performing large model checkpointing. We will provide a set of simple additions you can incorporate into your Ray Train training application to leverage all the benefits of fault-tolerant model training.
Justin is a software engineer at Anyscale, where he works on Ray AI Libraries. He is interested in making scalable AI more user-friendly and accessible, and he also has a passion for teaching and creating educational content. Prior to Anyscale, Justin graduated with a B.S. from UC Berkeley, where he did research on real-world robotic manipulation with reinforcement learning.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.