The most common method for deploying Ray on a Kubernetes cluster is to use KubeRay, which requires the use of CRDs as a prerequisite. The KubeRay operator, a critical component of the deployment, manages Ray cluster resources by monitoring Kubernetes events for creation, deletion, and updating. While the KubeRay operator can operate within a single namespace, the CRDs have a cluster-wide scope. However, many companies may need to deploy Ray clusters without requiring Kubernetes admin permissions. Our talk is about how to deploy Ray Cluster for Serving and Inference on an Air-Gapped Kubernetes Cluster with Tight Security Control.
Historically, one of the limitations of Ray was that if the Ray head crashed, the workers' metadata stored in the Ray head's internal GCS would be lost, then the workers would lose the connection. This posed a challenge for applications with long-running tasks and high availability requirements. However, the release of Ray version 2.0.0 brought a solution to this problem through GCS FT. GCS FT is available in KubeRay-managed Ray clusters, ensures that even if the Ray head fails, the metadata will not be lost, and the cluster will be able to recover from the failure without losing the workers.
In our talk, we will delve into how to configure GCS to write through an external Redis for fault tolerance in a non-KubeRay cluster. Furthermore, we will cover how to set the readiness probe and liveness probe for the Ray pods. We will demonstrate how to configure these probes to ensure the proper functioning of the Ray pods, providing an effective and reliable solution for applications with high availability requirements.
Using Ray for internal model training is one thing, but deploying it in a production environment, particularly in a customer's on-prem environment, is an entirely different story. It is often that the Kubernetes' network policy is "deny-all" in this environment, making it challenging to set up the communication between the Ray head and workers. In our talk, we will discuss how to overcome this challenge and configure the network policies for the Ray head and workers effectively.
Furthermore, companies are required to have encrypted communications between services to meet compliance standards. We will share how to build encrypted TLS connections between the Ray Cluster and other services using the mesh manager. We will demonstrate how to set up the mesh manager to build encrypted TLS connections for the Ray Cluster and other services, ensuring the secure transmission of data and compliance with regulatory requirements.
Yiqing Wang is a machine learning infrastructure engineer with 4 years of experience. Previously, he worked as a senior software engineer for the Logmein ML platform team, contributing to the development of the machine learning platform and feature store. Yiqing Wang is currently a software engineer at Instabase, leading the project to adopt Ray for model training and serving. He has also contributed to several open-source projects, including:
MLFlow: integrated Sagemaker Batch Transform with MLflow and enabled pushing models from MLflow to AWS Sagemaker Model Registry
Airbyte: added the AWS DynamoDB connector for the data platform
Ruoyu has a vast background in working on different machine-learning systems. His experience includes working on projects such as creating an Automatic Speech Recognition (ASR) service, constructing and executing the training pipeline for machine translation models, and establishing an end-to-end machine learning platform. Furthermore, Ryan's eagerness to expand his knowledge and proficiency in machine learning demonstrates his passion for utilizing his abilities to solve difficult real-world issues.
Past Experience:
Build speech recognition service at AWS
Build machine learning platform at Uber
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.