Spark DataFrames and ML Pipelines: In this talk, we will discuss two recent efforts in Spark to scale up data science: distributed DataFrames and Machine Learning Pipelines. These components allow users to manipulate distributed datasets and handle complex ML workflows, using intuitive APIs in Python, Java, and Scala (and R in development).
Data frames in R and Python have become standards for data science, yet they do not work well with Big Data. Inspired by R and Pandas, Spark DataFrames provide concise, powerful interfaces for structured data manipulation. DataFrames support rich data types, a variety of data sources and storage systems, and state-of-the-art optimization via the Spark SQL Catalyst optimizer.
On top of DataFrames, we have built a new ML Pipeline API. ML workflows often involve a complex sequence of processing and learning stages, including data cleaning, feature extraction and transformation, training, and hyperparameter tuning. With most current tools for ML, it is difficult to set up practical pipelines. Inspired by scikit-learn, we built simple APIs to help users quickly assemble and tune practical ML pipelines.