Millions of critical real-time decisions are made each day by online Machine Learning models at Lyft to shape how riders move and how drivers earn. To enable these decisions efficiently at scale, we grappled with several technical challenges: How could we design a serving system that can perform model inferences within single digit millisecond latencies and a throughput of 1,000,000+ requests per second? How would we make such a system support model sizes from low kilobytes to gigabytes and with model update periods as fast as a couple of minutes? How can we empower 40+ teams with use-cases across fraud detection, pricing, safety, ETAs, etc. to use any modeling libraries possible so that they can ship effective models fast, with no constraints? We built LyftLearn Serving, a scalable, flexible, distributed online model serving system to overcome these challenges. In this talk, we give an overview of the online model serving requirements at Lyft that drove us to build LyftLearn Serving. We showcase various techniques we used to tackle the aforementioned challenges to achieve a low latency, high throughput model serving system powering products of 40+ teams. We will also present design decisions we made for LyftLearn Serving for efficient versioning, deploying, testing, and monitoring ML models and describe tradeoffs that would help and inspire ML Ops practitioners while building similar systems.
Session Summary
Powering Millions of Real-time Decisions with Distributed Model Serving
MLconf 2023 New York City
Hakan Baba
Lyft
Staff Software Engineer
Learn more »
Mihir Mathur
Lyft
Product Manager, ML Platform
Learn more »