Looking Beyond the Algorithm — Operationalizing Data Science
I lead the engineering team at Alpine Data, an enterprise machine learning startup headquartered in San Francisco. We build an enterprise-grade data science platform that provides an end-to-end solution for data scientists, taking data scientists from problem definition through to model operationalization — where models can be seamlessly deployed to scoring engines running across the enterprise.
It’s this operationalization process that I want to discuss today. Data science is now a critical function of every enterprise, and there are significant expectations about the measurable value it should deliver. However, for an enterprise to benefit from the results of the data scientists, there needs to be an efficient mechanism to convey these insights to the broader enterprise in way that ensures that these findings are acted upon and the benefits realized.
While this sounds obvious, this requirement is frequently neglected. Too often the attention of the data science team is just focussed on leveraging ever more sophisticated algorithms and improving model accuracy and there is little thought of how these insights can change behavior across the enterprise.
While this operationalization can take many forms, including direct integration into CRM solutions, there is typically a general requirement to apply these models to real-time sources that are not necessarily integrated with the platform on which the models were trained. For instance, for predictive maintenance trained on a wealth of historical data on a Hadoop cluster, the models need to be applied to the real-time streams being emitted by the monitored machines. Similarly, for medical claims approval, network intrusion detection, or fraud detection, models trained on historical data needs to be applied to streams of incoming data and decisions made. Even in instances that don’t require real-time streaming support, data that needs to be scored on a daily, weekly or even monthly basis often resides on a different data platform than that holding the historical training data.
In many instances this is achieved by language specific exports. Models developed in Python/SciKit-learn can be exported and deployed to python-based scoring engines. Similarly, models created in R can be exported in an R specific format, and imported into R-based scoring engines. While this works, it requires a scoring engine paired with every possible model training environment. That’s fine for R and Python, but what about Spark, MADLib, Flink or TensorFlow? There’s a wide variety of environments used by data scientists to train models, and it’s important to avoid unnecessary proliferation in scoring engine environments.
Ideally, we should leverage a standardized serialization format for our modeling that can be widely shared. This was tried with limited success using PMML. PMML (or the Predictive Model Markup Language), was developed for developed for this purpose, and provides a platform independent way for expressing a wide variety of important machine learning models. And most ML toolkits provide support for exporting PMML models.
So why isn’t the use of PMML more prevalent? There are a number of different arguments that can be made, but I would argue that one of the strongest is that only represents a partial solution. Any scoring flow must not only specify the details of the trained model, but also the entire sequence of transformations that must be applying to the incoming data stream, as illustrated below.
Most scoring flows don’t simply involve applying the ML model directly to the incoming data to be scored. Rather, significant refinement of the incoming features (e.g. transformation, normalization, combination, and expansion) is required. Basic feature engineering can be encompassed by PMML. However, more often than not, the operations required can not be specified using PMML, and helper code has to be developed and deployed in conjunction with the PMML specification. This introduces the requirement for a Java, Python or C companion script, and eliminates many of the benefits associated with the use of PMML.
Happily, the Data Mining Group has been working to develop a new standard, called PFA (or the portable format for analytics). This standard benefits from the 20 years of experience we have gained with PMML, and ensures that a wide variety of transformation operations can be readily expressed in PFA, allowing the entire end-to-end scoring flow to be encapsulated in a single PFA document.
For folks that are interested in more details about PFA, details of the specification can be found here. Additionally, the Open Data Group have kindly provided an Apache licensed complete PFA implementation coded in Java and with Python wrappers, and I would encourage folks to download and experiment with PFA. Finally, a large part of the efforts behind driver broader adoption of PFA is to ensure that the most widely used OSS packages (e.g. scikit-learn or MLlib) provide support for PFA — please let us know if you are interested in getting involved in this effort!
Lawrence Spracklen leads engineering at Alpine Data. He is tasked with the development of Alpine’s advanced analytics platform. Prior to joining Alpine, Lawrence worked at Ayasdi as both VP of Engineering and Chief Architect. Before this, Lawrence spent over a decade working at Sun Microsystems, Nvidia and VMware, where he led teams focused on the intersection of hardware architecture and software performance and scalability. Lawrence holds a Ph.D. in Electronic engineering from the University of Aberdeen, a B.Sc. in Computational Physics from the University of York and has been issued over 50 US patents.