“Training Compute Platform” powers 100% of Pinterest Training workloads and has been the underlying infrastructure on which ML innovations are iterated on and productionized. We have worked very closely with our partners in Ads and Organic Surfaces to power step function innovations that have made ML training faster and more cost efficient. We enable more than 400 ML engineers to seamlessly iterate and experiment on their ML training without ML engineers having to worry about the operational details like Nvidia driver upgrades, cuda drivers, dependency management, host provisioning, instance upgrades etc. By providing a platform which centralizes and solves the above problems we are enabling ML engineers to focus on their strengths and provide massive leverage. Reducing the ops burden speeds ML developer velocity tremendously and allows ML engineers to iterate on next generation models thereby enabling rapid innovation at Pinterest. We solved this problem by building of top of Kubernetes and its ecosystem to support single node and multi node training. Using these compute framework we have also been able to power features like Multi Node Multi GPU Training, Hyperparameter Tuning and data privacy aware ephemeral developer instances on GPUS.
Session Summary
The Infrastructure that Powers ML Training at Pinterest
MLconf 2023 New York City
Karthik Anantha Padmanabhan
Pinterest
Engineering Manager, Tech Lead
Learn more »