Guest Post by Dr. Joonyoung Kim: Orchestrating the Next Generation of Machine Learning

In the history of computing, software engineers have always been constrained from innovating to their full potential due to compute hardware performance limitations, despite remarkable growth in hardware performance. Performance bottlenecks include the slowing growth of CPUs, memory bandwidth and capacity, I/O speed and network throughput and latency just to name a few.


Figure 1

From 1970 to 2008, computing performance experienced a 10,000-fold improvement, essentially doubling every two years and widely known as Moore’s Law, see Figure 1. However, from 2008 onward, the industry experienced just an 8x to 10x performance increase. This while rapidly developing machine learning and deep learning applications were coming on board with extensive data sets and their computationally intensive models. Throughout the history of computing, hardware has never been fast enough for software engineers. Now, machine learning engineers are facing the same challenge. There have been times when some hardware performance (whether it was CPU clock frequency, DRAM bandwidth and HDD and flash capacity) seemed sufficient for existing and known workloads but the industry continues to innovate by coming up with different ways to solve problems that used to be considered intractable. And innovation means that the next greatest thing tends to overwhelm current hardware capacity.

Deep learning models, as an example, require terabytes or even petabytes of data to train them and address the hundreds of thousands of requests per second from billions of users with less than tens of millisecond response time if you look at the workloads of hyper-scale data center operators who develop and deploy more and more machine learning models every day.

There are other issues as well. With so many different (heterogeneous) processing units in our era of deep learning, the ubiquitous CPU is no longer dominant due to the emergence of other silicon-based processors and accelerators. The era of heterogeneous computing has been discussed for a long time in the academia but with deep learning going mainstream and growing rapidly, we are finally seeing that these various chips (FPGA, ASIC, GPU, to name a few) fulfilling the requirements of different workloads.

While integrated circuits have been in development since the 1950s, the Intel 4004, a 4-bit central processing unit (CPU), was released by Intel Corporation in 1971. It was the first commercially available microprocessor by Intel, and the first in a long line of Intel CPUs. The initial ASICs used gate array technology and were introduced in 1981 and 1982. FPGAs, led by Xilinx, followed in 1985. GPUs, specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images, were popularized by Nvidia in 1999 although the term had been in use since at least the 1980s.


All of these solutions have been carving out niches and driving innovation in various markets. That’s the good news. The bad news is that machine learning engineers who are tasked with developing their models on top of frameworks such as Caffe, Tensorflow and PyTorch, do not have the expertise nor time to decide what hardware platform their workloads should run on.

This is why machine learning engineers need a layer of abstraction for different platforms and this is where the concept of orchestration comes into play. ML engineers need to be able to launch their jobs, with an orchestration layer handling the rest by picking the best hardware resource for the right job. Overall efficiency and utilization would be maximized and overall TCO (total cost of ownership) would be reduced.

With respect to efficiency, many applications have had to rely on traditional scaling which is structured simply and designed around maximum demand. This method is designed to have the application running all the time and at maximum capacity, which, as previously noted, is inefficient. Time-based scaling represents an improvement over traditional scaling because it is designed for maximum use during peak times of the day, but it still has a fairly high TCO.

Real time scaling, however, is designed for minimum demand. Using this approach, different algorithms can be dynamically loaded and provisioned so that the system is off doing other tasks until needed. This keeps utilization low and offers a much higher levels of efficiency and TCO allowing ‘spare’ compute power to be utilized in other applications when not needed. Please see Figure 2.


Figure 2

Now I want to discuss the benefits of FPGA-based acceleration for machine learning. First, there are many problems that can benefit from FPGA-acceleration as opposed to running on CPU. These are generally data crunching algorithms and include, but are not limited to, compression, encryption, convolution, de-duplication. There have been many machine learning accelerators developed on FPGAs and the innovation is just starting. Second, FPGA offers a high degree of customization opportunity. For instance, unlike CPUs and GPUs where data-path width is fixed to a power of 2 (32 bits, 16 bits, 8 bits), a custom precision such as FP11 (Floating Point 11 bits) can easily be supported in FPGA and consume just enough resource to support the precision. Third, with relatively new technology like partial reconfiguration, it is now possible to switch the contents of FPGA in a matter of a second from a host application. This allows the orchestration layer to dynamically provision the FPGA with different images for different problem domain depending on workload requirements. Contrast this to building a custom ASIC – even though ASIC can deliver better performance for a specific problem, FPGA-based acceleration can offer ultimate flexibility and does not require multi-year investment, let alone development cost which runs in tens of millions easily, in designing and manufacturing the ASIC.

Also, recent advances in design methodology such as OpenCL and HLS (high-level synthesis), coupled with traditional RTL design methodology make it possible for us to quickly design and deploy the right acceleration solution on any given FPGA platform based on customer needs. This new methodology really allows us to explore different design space and helps us come up with optimal architecture for the problem we need to solve and does not require so many verification steps compared to ASIC design flow. If you look at the time and effort that is spent on modern-day ASIC design flow, the majority of it is spent on not architecting and designing the chip but in verifying it because the cost of a silicon escape and resulting silicon re-spin is so onerous.

In closing, I hope this blog helps you understand the massive infrastructure challenge for machine learning acceleration. Well-designed machine learning acceleration system including orchestration layer and underlying FPGA hardware and contents can significantly increase overall performance and capability that is available to machine learning engineers. This, in turn, will help them design and deploy next generation machine learning algorithms to advance the state of machine learning capability.

 

Dr. Kim is currently working on building FPGA-based accelerators at NVXL Technology. He has been working on chip design since he started at Intel in 2001. His focus area has been front-end design and design methodology for Intel’s flagship client and server CPUs. In 2013, he joined SK Hynix as the Director of System Architecture where he was the founding member and team leader of Solutions Lab and worked on future memory and interface definition. In 2015, he returned to Intel and worked as the logic design manager of key server IPs for Xeon CPUs until the end of 2017. Joon holds a BS from Seoul National University, MS and Ph.D. from the University of Michigan, and an MBA from Berkeley Haas School of Business.