Lightning Talks

Building an Instant-On Serverless Platform for Large-Scale Data Processing Using Ray

September 18, 5:45 PM - 6:00 PM

AWS Glue has been pioneering in the space of automating ETL processes by providing a fully managed serverless data integration service. This service is a simple and cost-effective way for customers to categorize their data, clean it, enrich it, and move it swiftly and reliably between various data stores. AWS Glue is made up of a Data Catalog (i.e a metadata store), sophisticated ETL engines with automated code generation and visual interfaces for every persona to do ETL tasks. AWS Glue customers use Apache Spark and Python engines for data integration.

Our Python customers have asked for scaling their Python workloads over large datasets. To enable these use-cases, AWS Glue added support for Ray.io (http://ray.io/) and launched AWS Glue for Ray. AWS Glue for Ray provides data engineers a distributed Pythonic data analytics platform for performing distributed data integration at scale with Ray core APIs. Using Ray's powerful abstraction of tasks/actors, we were able to horizontally scale python workloads. The simple distributed collection APIs provided by Ray dataset helped our python customers to perform ETL operations efficiently on very large datasets. Distinctly, we launch Ray clusters on ARM based platforms and using IPv6 addressing based workers. Data engineers are comfortable with Pandas and given that popularity, we integrated Modin at scale with Ray. We will cover our experiences with Ray datasets and distributed Pandas at scale. We will also talk about the innovations we did integrating with Ray's robust cluster manager and demand based autoscaler, to offer an instant-on, interactive, easy to use serverless Ray platform for our customers.

About Japson

Japson Jeyasekaran is a senior software engineer in AWS Glue building serverless infrastructure platform for various data analytics engine. (edited)

About Harish

Harish Sitaraman is an engineering manager at AWS. His teams own AWS Glue for Ray and the serverless platform that powers AWS Glue workloads. Prior to AWS Glue, Harish has had multiple management and technical leadership roles with AWS Networking and Juniper Networks.

Japson Jeyasekaran

Sr Software Engineer, AWS

Harish Sitaraman

Software Development Manager, AWS
Photo of Ray Summit pillows
Ray Summit 23 logo

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Photo of Ray pillows and Raydiate sign
Photo of Raydiate sign

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.