Large-scale machine learning offers a considerable advantage, but often the challenge with big data is not just the learning itself, but the preprocessing and cleaning needed to get data into a learnable form. Users are forced to mix a variety of tools (e.g. Hadoop, SQL, Pig) for preprocessing, which not only increase complexity, but can take longer to execute than the learning algorithm itself. Apache Spark is a general-purpose cluster computing engine that can solve this problem by efficiently supporting *both* data processing tasks and machine learning. Spark runs a generalization of the MapReduce model that supports both iterative algorithms (and hence most common ML tasks) and multi-stage data processing (e.g. SQL), outperforming MapReduce by as much as 100x in both cases. In addition, these computations can efficiently be combined, sharing data through memory. Using Spark, it’s possible to write an end-to-end ML workflow in one program that will not only outperform a traditional one but be easier to iterate on and maintain. Finally, Spark offers high-level APIs in Scala, Java and Python and supports interactive use, making it possible to use one tool for both initial prototyping / exploration and large-scale deployment on a cluster.
Session Summary
Large-scale machine learning offers a considerable advantage
MLconf 2013
Matei Zaharia
Databricks
CTO
Learn more »