Data loading and preprocessing can easily become the performance bottleneck in ML training pipelines. Data preprocessing requirements are also becoming more complicated as the types of data being processed are becoming more diverse. With Ray Data, data loading can be fast, flexible and scalable. In this talk, we’ll dive into the performance of different open-source data loader solutions. We’ll show how Ray Data can match PyTorch DataLoader and tf.data in performance on a single node, while also providing advanced features necessary for scale, such as in-memory streaming, automatic recovery from out-of-memory failures, and support for heterogeneous clusters.
Takeaway:
• Ray Data provides a combination of speed, scale, and flexibility unmatched by other open-source data loaders.
Stephanie is a software engineer at Anyscale, a Ray committer, and an author of Ray core. She is working on problems related to data processing and distributed execution with Ray. In fall ‘24, she will join the computer science faculty at the University of Washington.
Scott Lee is a software engineer at Anyscale, currently working on the Ray Data team. Prior to joining Anyscale, he explored problems in growth engineering at Lyft and pursued a Master's degree at UC Berkeley.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.