Fault-tolerable Deep Learning on General-purpose Clusters: Researchers have been used to running deep learning jobs on clusters. In industrial applications, AI is built on top of big data and deep learning is only one stage of the data pipeline. That is where MPI-based clusters are not enough, and general-purpose cluster management systems are necessary to run Web servers like Nginx, log collectors like fluentd and Kafka, data processors on top of Hadoop, Spark, and Storm, and deep learning, which improves the Web service quality. This talk explains how we integrate PaddlePaddle and Kubernetes to provide an open source fault-tolerable large-scale deep learning platform.
Session Summary
Fault-tolerable Deep Learning on General-purpose Clusters
MLconf 2017 New York City
Yi Wang
Baidu
Tech Lead of AI Platform
Learn more »