In this talk I will describe Quokka, a distributed fault-tolerant pipelined SQL query engine built on top of Ray. Quokka uses Ray as a cluster manager similar to how Spark uses Yarn. Quokka delegates single node compute kernels to Polars and Apache Arrow, while relying on Redis for global coordination. To learn more about Quokka, please see Robert's tweet: https://twitter.com/robertnishihara/status/1609014781540798466?s=20&t=Lv5iOWYroW-ZKal_qOUTBQ, and the Github repo linked.
I will first describe the motivations of Quokka, and how it's designed to support feature engineering workloads on time-series data that current query engines don't support very well. These include esoteric window computations and special streaming joins typically seen in real-time systems for finance, recommendation or cybersecurity.
Typically, testing these real-time applications on historical data is difficult: people run historical data as streams through the streaming engine (e.g. Flink), which is not optimized for good performance on static data. Quokka speeds up these computations on historical data to allow rapid iteration of streaming pipelines, feature backfilling, and fast streaming failure recovery.
I will focus my talk on how Quokka uses Ray. I will showcase how the simplicity and stability of Ray allows Quokka's core execution engine to be only less than 1000 lines of Python. I will also describe alternative Ray designs that we considered and why we ended up on our current architecture, where long running Ray actors handle user-defined Quokka tasks.
I will end with a list of feature requests for Ray (which might be addressed before the talk even!), for example automatic swapping out of actors to disk instead of running out of memory.
Key takeaways:
Ray is amazing. Its flexibility allows you to build distributed applications limited only by your imagination. Even if your current design doesn't quite work, you will eventually be able to use Ray to find a working design.
Ray offers a lot of functionalities out of the box, like object communication and fault tolerance. However depending on your application you might want to build your own components while using the rest of Ray.
Stream processing is hard -- the hardest part is testing your application on historical data. By offering a unified stream/batch API with custom optimizations, Quokka speeds up common stream processing primitives on historical data.
I'm a PhD student at Stanford University. To see what I am working on, please check out Dr. Robert Nishihara's tweet: https://twitter.com/robertnishihara/status/1609014781540798466?s=20&t=fm1Rea0g0xiKqI7haBhJ0g.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.