At Netflix, Machine Learning algorithms are at the heart of various use cases such as recommendations, content understanding, content demand modeling, trailer and artwork generation and various other content creation use cases. Scaling these use cases to entertain our members can significantly leverage deep learning techniques. The Machine Learning Platform team at Netflix is tasked with constructing the necessary infrastructure and tools to optimize the effectiveness of all machine learning practitioners across the company. We are constantly striving to ensure that our machine learning models are trained and deployed in a reliable, scalable and robust way.
Deep learning models have grown in complexity, requiring significantly more computational resources to train. In this Talk, we explore the benefits of using Ray for building a heterogeneous training cluster, and discuss the steps involved in setting up such a cluster. We demonstrate how to run distributed training jobs on the cluster with a mix of CPU instances and GPU instances, and show how Ray's automatic resource allocation and management can facilitate the scheduling of different types of workers .Additionally, we discuss the challenges and considerations that come with building and managing persistent clusters using Ray, and provide best practices for effective cluster configuration and management.
Pablo Delgado is a Machine Learning Engineer, he currently works on building the training infrastructure for the Machine Learning Platform at Netflix that powers Personalized Recommendation Algorithms and Content/Media production. Previously he was working on the recommendation systems stack for personal restaurant recommendations at Opentable. Pablo obtained a degree in Mathematics and Computer Science in University College London, London United Kingdom, where he worked on Graph based Methods for Collaborative Filtering.
Lingyi Liu is a Machine Learning System Tech Lead, working on building and performance optimization of large scale model training and inference platforms in Netflix for recommendation and media applications. Prior to that, he worked in Meta's PyTorch/Caffe2 team, contributing to PyTorch development, optimizing and productionizing a variety of PyTorch DL models at large scale. Before that, he worked in Synopsys on electronic design automation software. Lingyi completed his PhD in 2014 from University of Illinois at Urbana-Champaign working on applying machine learning for system design and verification.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.