In this presentation, we discuss the implementation of a computer vision model using the Ray open-source framework and compare its training performance against another popular distributed training framework, Kubeflow. Our goal was to evaluate the cost effectiveness, training speed, gpu utilization, and throughput of training the machine learning model using both frameworks.
Ray is an open-source framework that is designed to make it easy to scale and optimize distributed Python applications. We chose to evaluate Ray for this project because it has built-in support for distributed training of machine learning models and can utilize an inexpensive data storage framework, Amazon S3, for data storage, which is can be more cost-effective than the use of Amazon elastic file storage (EFS).
To compare the performance of Ray and Kubeflow, we trained our computer vision model on a large production dataset using both frameworks. We measured the training time, cost, and throughput for each framework and found that Ray was more cost-effective and faster than Kubeflow, even while using S3 for data storage in comparison with EFS.
One of the key reasons for the performance improvement of Ray is its use of distributed training. By distributing the workload across multiple nodes, Ray was able to complete the training process faster and at a lower cost than Kubeflow. We utilized in-expensive computational optimizations to process data quickly from S3.
We used Ray Dataset to parallelize our data IO, preprocessing and ingesting to GPU trainers at scale. From our experiments, trainer GPUs can achieve desired saturation with image data stored on commodity S3 storage, where data workload happens on heterogeneous, commodity CPU workers for best cost efficiency.
In addition, Ray Dataset also has built-in support for full streaming execution that lowers memory and compute requirement for training data at larger scale. With optimizations of vectorized preprocessing UDF, batching and prefetching, we have shown Ray AIR can effectively handle and scale computer vision workloads with large amount of training data without sacrificing performance.
In conclusion, our project demonstrates certain benefits of using Ray for computer vision applications. By leveraging its distributed training capabilities along with its use of cost-effective S3 data storage, we were able to achieve improved performance in terms of cost-effectiveness and training speed compared to Kubeflow's use of a more expensive EFS storage.
Our findings are that this can have important implications for businesses and organizations that rely on computer vision models to drive their operations.
David is a Machine Learning DevOps Engineer on the Caper team at Instacart who supports computer vision systems including infrastructure and software development. Previously David was a team lead at Acronis where he helped implement machine learning infrastructure projects. Prior to that he worked in startups and holds an M'Sc in Computer Process Control Engineering from the University of Alberta.
Outside of work, you can find David getting prepping for his next marathon or studying Chinese, Spanish, and Korean. Ultimately working towards his goal of being a polyglot and helping educating others via his Youtube channel.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.