Ray Deep Dives

Fast, Flexible, and Scalable Data Loading for ML Training with Ray Data

September 18, 2:30 PM - 3:00 PM
View Slides

Data loading and preprocessing can easily become the performance bottleneck in ML training pipelines. Data preprocessing requirements are also becoming more complicated as the types of data being processed are becoming more diverse. With Ray Data, data loading can be fast, flexible and scalable. In this talk, we’ll dive into the performance of different open-source data loader solutions. We’ll show how Ray Data can match PyTorch DataLoader and tf.data in performance on a single node, while also providing advanced features necessary for scale, such as in-memory streaming, automatic recovery from out-of-memory failures, and support for heterogeneous clusters.

Takeaway:

• Ray Data provides a combination of speed, scale, and flexibility unmatched by other open-source data loaders.

About Stephanie

Stephanie is a software engineer at Anyscale, a Ray committer, and an author of Ray core. She is working on problems related to data processing and distributed execution with Ray. In fall ‘24, she will join the computer science faculty at the University of Washington.

About Scott

Scott Lee is a software engineer at Anyscale, currently working on the Ray Data team. Prior to joining Anyscale, he explored problems in growth engineering at Lyft and pursued a Master's degree at UC Berkeley.

Stephanie Wang

Software Engineer, Anyscale

Scott Lee

Software Engineer, Anyscale
Photo of Ray Summit pillows
Ray Summit 23 logo

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Photo of Ray pillows and Raydiate sign
Photo of Raydiate sign

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.