Democratizing Production Machine Learning
By Vid Jain, CEO & Founder at Wallaroo.AI
When companies start out their AI/ML journeys, they often focus on gathering and cleansing data and developing ML models around an initial use case or two. Often, the strategy around actually taking that model from prototype to production, and then driving it to a business outcome with actual ROI is left to the last moment. This lack of planning, preparation and ultimately taking an ad-hoc approach to operationalizing ML is responsible for the poor ROI even when organizations have high quality data and model ideas.
The model is only a small part of actually getting AI embedded into the business. There is a large amount of process, tooling and infrastructure “plumbing” required to deploy ML models and create a feedback loop ensuring the models remain relevant as the world changes. Often this is tackled in a non-repeatable, non-scalable way for each individual model, with weeks to reengineer a data scientist’s Jupyter notebook into a production-grade environment.
Only a fraction of production ML is composed of the ML code (the small box in the middle). The rest are complex infrastructure and operations requirements to ensure it runs correctly in production and provides feedback as the environment changes. From Hidden Technical Debt in Machine Learning Systems, by D Scully, et al.
In 2006 I joined a newly formed automated and high frequency trading group inside a major Wall Street firm. Our goal was to both renovate the “high-touch” manual trading business as well as build a large high-frequency trading business via data and algorithms. Our success would be measured by profitability. That experience made me realize that good ideas and algorithms were not enough. They often died en route to or running in production. Sometimes they were not viable due to the compute cost of actually running them. There was also inherent business risk in everything that we did that we had to understand and monitor. Finally, we couldn’t scale the business without scaling the number and variety of models, and that resulted in its own management overhead.
Treating each individual model as a pet requiring weeks or months of manual labor to get into production and its own ongoing special care wouldn’t let us scale our business, diversify our risk, and deliver a high ROI. We had to have industrial processes and tools to deploy, monitor, optimize our many algorithms at scale.
This is the same challenge facing enterprises across industries: while 85% of CEOs identify AI as a strategic imperative, only 10% of AI initiatives generate any sort of substantial ROI. In fact, only about half get past the prototype stage, and the few that do get into production eventually become obsolete as the world changes. The traditional way of approaching ML operations makes deploying and managing each model highly time-intensive and manual. The largest enterprises often throw headcount at the problem by hiring ever more ML engineers to create custom deployment solutions built on open source software like Spark.
So while companies of all sizes and maturities are investing in AI to build predictive models off their troves of proprietary data, what determines the AI/ML winners and losers is not headcount or legacy vs. startup. Rather, it is the ability for their data teams to build repeatable, scalable, and measurable processes to deploy and manage models in production.
Repeatability and scalability is often misconstrued to mean data scientists should only use a certain set of tools or platforms. We believe the opposite – when it comes to AI different frameworks and tools are better for different kinds of use cases (e.g., a computer vision problem versus something like time series demand forecasting). Additionally, latency, regulatory, and cost requirements can mean models need to be deployed in environments different from where they were developed.
So is standardization even possible when it comes to production ML? It is, but it means Chief Data & Analytics Officers need to design their AI/ML programs with production in mind. This means platforms that can meet requirements like:
- How do I know the model inference is meeting the cost, latency, or throughput requirements needed by the business or other downstream systems?
- How do I easily make an update to a live model without impacting the business?
- How do I test a new model version so I can roll it out with confidence?
- How do I monitor the ongoing performance of a model?
- How do I scale the production environment for more data or use cases?
- How do I provide audit for compliance and fit into an overall governance scheme?
The data science ecosystem will never be simple. But this doesn’t mean AI/ML in production will only be for a few winners. CDAOs can build sustainable, profitable programs by focusing on automation, standardization, and simplicity when it comes to operationalizing models.
Bio:
Vid Jain holds a Ph.D. in Theoretical Physics from UC Berkeley, is the author of 3 internet tracking patents, and has spent the last 20 years pushing the technology envelope with data-driven applications. Vid is currently the CEO & founder of Wallaroo.AI a company hyper-focused on making it easy for AI teams to deploy, observe, and manage machine learning models in production at scale.
