During the COVID pandemic, many governments around the world turned to AI to make sense of the uncertainty posed by successive waves of infections.
In the UK, the National Health Service (NHS) was particularly concerned about the risk of hospitals not having enough capacity in any of the 219 trusts around the country.
This led to the Early Warning System (EWS) project, which was led by Faculty, a leading UK Artificial Intelligence company, who developed a state-of-the-art Bayesian Hierarchical Model to capture the range of possible short-term risks facing hospitals.
The early warning system ran hundreds of parallel batch processes nightly, with an additional set of jobs to explain and calibrate model predictions using thousands of cores. This was critical to AI safety but posed serious scaling challenges.
Throughout the first year of the Early Warning System's operation, there were two catastrophic outages relating to the overuse of Kubernetes jobs.
Faculty reached out to us at Treebeardtech to resolve this MLOps issue, resulting in a multi-year partnership.
An ambitious re-architecture of the system that ultimately led us to employ Ray Core and Kuberay has resulted in a highly stable architecture and set of guidelines that we can share with you.
This talk aims to give an overview of how we safely integrated Ray into a business-critical real-world system and complemented it with the AWS Karpenter autoscaler to make a powerful platform for Bayesian modelling.
Alex Remedios is a site reliability engineer and founder of Treebeardtech, an MLOps company based in the UK.
Alex has a background in distributed systems, working as an engineer at Microsoft, Amazon, and Improbable.
Current focuses are helping AI teams deploy models reliably in the cloud, and supporting open-source communities in the data science and machine learning space.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.