AI/ML Platform & Applications

Heterogeneous Training Cluster with Ray at Netflix

September 18, 1:45 PM - 2:15 PM
View Slides

At Netflix, Machine Learning algorithms are at the heart of various use cases such as recommendations, content understanding, content demand modeling, trailer and artwork generation and various other content creation use cases. Scaling these use cases to entertain our members can significantly leverage deep learning techniques. The Machine Learning Platform team at Netflix is tasked with constructing the necessary infrastructure and tools to optimize the effectiveness of all machine learning practitioners across the company. We are constantly striving to ensure that our machine learning models are trained and deployed in a reliable, scalable and robust way.

Deep learning models have grown in complexity, requiring significantly more computational resources to train. In this Talk, we explore the benefits of using Ray for building a heterogeneous training cluster, and discuss the steps involved in setting up such a cluster. We demonstrate how to run distributed training jobs on the cluster with a mix of CPU instances and GPU instances, and show how Ray's automatic resource allocation and management can facilitate the scheduling of different types of workers .Additionally, we discuss the challenges and considerations that come with building and managing persistent clusters using Ray, and provide best practices for effective cluster configuration and management.

About Pablo

Pablo Delgado is a Machine Learning Engineer, he currently works on building the training infrastructure for the Machine Learning Platform at Netflix that powers Personalized Recommendation Algorithms and Content/Media production. Previously he was working on the recommendation systems stack for personal restaurant recommendations at Opentable. Pablo obtained a degree in Mathematics and Computer Science in University College London, London United Kingdom, where he worked on Graph based Methods for Collaborative Filtering.

About Lingyi

Lingyi Liu is a Machine Learning System Tech Lead, working on building and performance optimization of large scale model training and inference platforms in Netflix for recommendation and media applications. Prior to that, he worked in Meta's PyTorch/Caffe2 team, contributing to PyTorch development, optimizing and productionizing a variety of PyTorch DL models at large scale. Before that, he worked in Synopsys on electronic design automation software. Lingyi completed his PhD in 2014 from University of Illinois at Urbana-Champaign working on applying machine learning for system design and verification.

Pablo Delgado

Machine Learning Engineer, Netflix

Lingyi Liu

Software Engineer, Netflix
Photo of Ray Summit pillows
Ray Summit 23 logo

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Photo of Ray pillows and Raydiate sign
Photo of Raydiate sign

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.