Guest Blog by Rommel Garcia, Director, Solution Engineering, Southeast, Kinetica

Why GPU-Optimized Databases Are Such a Game Changer for Machine Learning and AI

GPUs are well-suited for the types of vector and matrix operations found in machine learning and deep learning, and they can dramatically reduce the amount of time it takes to train a model.

A distributed, in-memory database accelerated by GPUs leverages the parallel processing power of GPUs to converge ML, Deep Learning, AI, and classic OLAP-type workloads on one powerful platform. In addition, databases like Kinetica with its open framework make it possible for machine learning and artificial intelligence libraries such as TensorFlow, BIDMach, Caffe, Torch and others to work directly on data.

Figure 1. GPU Database For OLAP, Geospatial and Data Science

GPUs accelerates simulations typically at 100x faster performance and at 1/10th of the hardware of traditional CPU-based databases.

Here are several reasons why you should choose a GPU-accelerated database for your ML and AI workloads:

1)    One single platform. With a GPU-accelerated database, you can use both simple and complex machine learning and deep learning algorithms on one platform. It simplifies your data pipelines by fusing the speed and batch processing layers. Parallel ingest and reduced reliance on indexes mean data is available for query the moment it arrives. Additionally, with UDFs (user-defined functions), data exploration and analytics can all be performed on a single compute-heavy platform. With user-defined functions, custom code can be run directly on the data within the database. This means that you can perform complex queries on billions of rows of data in well under a second, without needing to move data between systems; it can all be done on the same platform that you’re using for traditional analytics and data exploration. Why custom code? Canned algorithms from existing tools don’t cut it anymore. Organizations are heavily investing in a team of data scientists to improve their success rate in the business.

2)   Say goodbye to traditional ML modeling. By using a GPU-accelerated database, you don’t have to manage flat files in your file system; and keep in mind there’s no security in these flat files. Multiple replicas of data eliminate single points of failure, and provide an eventually consistent data store with recovery capability. You also don’t need to write intermediate file formats that are frequently seen with Monte Carlo simulations. A GPU database can automate the process.

3)  Built-in geospatial capability. If your machine learning/deep learning project requires location-based analytics, a GPU-accelerated database can provide that capability. GPU databases come with a native geospatial and visualization pipeline for rendering large volumes of data over maps. Its built-in geometry engine combines native geospatial functionality, filtering, and GPU acceleration to deliver a database that can work with massive geospatial datasets—all within a single system. Predicting optimal hotel sites, where landslides will occur, flood susceptibility assessment, understanding disease outbreaks and others are typical ML use cases using spatial data.

4)  Open framework = multiple options. With a UDF open framework , you can easily incorporate other machine learning platforms. Some GPU databases include a native REST API and associated connectors that provide data scientists with robust tools for interacting with the Spark ML language, Python, TensorFlow, Node, Java, C++, and more.

Figure 2. User Defined Function (UDF) Framework – Bring Your Own Code

5)  Significant cost savings. Since a GPU-accelerated database can provide everything on one platform, you will enjoy significant cost savings. There’s no need to purchase a separate database or server as a machine learning/deep learning/geospatial platform—you simply pay for one license to get access to the entire platform.

6)  Increased performance/collaboration. Since all data have been cleansed and highly structured in a table, data scientists and analysts can collaborate effectively simplifying the process of data discovery, data processing, and predictive modeling.

7)  Simulations are screaming fast. GPU databases are highly distributed and very different from the old legacy architectures. It is built from the ground up to utilize GPU’s thousands of cores as its compute framework.

8)   Focus on data value, not ETL. With a GPU database, data scientists don’t need to perform ETL on the data once it’s in the database. This frees up precious time so that data scientists can focus on finding value and insights in the data. In-database AI processing can help data scientists discover patterns and uncover hidden insights in sub-seconds. You’ll be able to run customized, GPU-accelerated algorithms to achieve your objective, whether you’re detecting fraud, providing online recommendation offers, or analyzing data streams.

Code Examples:
Here’s an example of how to construct UDF (user-defined function) code, including a TensorFlow example. After registration, the following code will be running in parallel using multiple threads. It is very easy to get a database table into a Python dataframe or write a Python dataframe to tables.

Example using CPUs:
# The following code is a distributed UDF (user-defined function) Python source code
# This code will be running in parallel with multiple threads

# import libraries.
from kinetica_proc import ProcData
import pandas as pd
import numpy as np

# get reference to input tables and output table.
proc_data  = ProcData()

# there is only one table as input which is lci_input defined in lci_create_proc.py
input_data = proc_data.input_data[0]

 

# create a dataframe from the table
mydf = pd.DataFrame({
‘serialNum’ : input_data[“serialNum”],
‘dist_type’ : input_data[“dist_type”],
‘dist_para1’ : input_data[“dist_para1”],
‘dist_para2’ : input_data[“dist_para2”],
‘dist_para3’ : input_data[“dist_para3”],
‘dist_para4’ : input_data[“dist_para4”],
‘dist_para5’ : input_data[“dits_para5”]
})

if len(mydf) > 0:
serialNum = mydf.serialNum[0]
dist_type = mydf.dist_type[0]
dist_para1 = mydf.dist_para1[0]
dist_para2 = mydf.dist_para2[0]
dist_para3 = mydf.dist_para3[0]
dist_para4 = mydf.dist_para4[0]
dist_para5 = mydf.dist_para5[0]

# for each distribution, generate simulated data
simuData = []
for index, row in mydf.iterrows():
if row[‘dist_type’] == ‘normal’:
simuData = np.append(simuData,np.random.normal(row[‘dist_para1’],row[‘dist_para2’],10000))

# Logic for output table.
#The output table is python_output_01

output_python_01_table = proc_data.output_data[0]
row_total = len(simuData)
output_python_01_table.size = row_total

# map the output columns to lists
col1=output_python_01_table[“serialNum”]
col2=output_python_01_table[“dist_type”]
col3=output_python_01_table[“dist_para1”]
col4=output_python_01_table[“dist_para2”]
col5=output_python_01_table[“dist_para3”]
col6=output_python_01_table[“dist_para4”]
col7=output_python_01_table[“dist_para5”]
col8=output_python_01_table[“simu_result”]

#give values to the lists
for i in range(row_total):

col1[i] = serialNum
col2.append(dist_type)
col3[i] = dist_para1
col4[i] = dist_para2
col5[i] = dist_para3
col6[i] = dist_para4
col7[i] = dist_para5
col8[i] = simuData[i]

# Call API to complete the UDF. Output data will be stored in tables
proc_data.complete()

 

Example using GPUs:

# The following code is a distributed UDF (user-defined function) Python source code
# This code will be running in parallel with multiple threads
# This code will use GPU to accelerate the calculation

# import libraries.
from kinetica_proc import ProcData
import pandas as pd

import numpy as np
import datetime
import os
import tensorflow as tf

# TensorFlow setup
os.environ[‘TF_CPP_MIN_LOG_LEVEL’] = ‘3’
config = tf.ConfigProto()
config.log_device_placement = True
config.gpu_options.allow_growth = True

# get reference to input tables and output table.
proc_data = ProcData()

# Assign which GPU will be used for each thread
config.gpu_options.visible_device_list= str(int(proc_data.request_info[“rank_number”]) – 1)

# there is only one table as input which is lci_input defined in lci_create_proc.py
input_data = proc_data.input_data[0]

# create a dataframe from the input table
mydf = pd.DataFrame({
‘serialNum’ : input_data[“serialNum”],
‘dist_type’ : input_data[“dist_type”],
‘dist_para1’ : input_data[“dist_para1”],
‘dist_para2’ : input_data[“dist_para2”],
‘dist_para3’ : input_data[“dist_para3”],
‘dist_para4’ : input_data[“dist_para4”],
‘dist_para5’ : input_data[“dits_para5”]
})

if len(mydf) > 0:
serialNum = mydf.serialNum[0]
dist_type = mydf.dist_type[0]
dist_para1 = mydf.dist_para1[0]
dist_para2 = mydf.dist_para2[0]
dist_para3 = mydf.dist_para3[0]
dist_para4 = mydf.dist_para4[0]
dist_para5 = mydf.dist_para5[0]

# for each distribution, generate simulated data
simuData = []
init = tf.global_variables_initializer()
with tf.Session(config=config) as sess:
sess.run(init)
time_step3 = datetime.datetime.now()
#with tf.device(‘/gpu:7’):
simuData = sess.run(tf.random_normal([num_simu,1],mean=dist_para1,stddev=dist_para2,dtype=tf.float32))

#temp code for testing
cnt = len(simuData)

serialNum=[serialNum for _ in range(cnt)]
# Logic for output table.
#The output table is python_output_01

output_python_01_table = proc_data.output_data[0]
output_python_01_table.size = cnt

# map the output columns to lists
col1=output_python_01_table[“serialNum”]
col8=output_python_01_table[“simu_result”]

col1.extend(serialNum)
col8.extend(simuData)

# Call API to complete the UDF. Output data will be stored in tables
proc_data.complete()

 

Summary

Organizations are infusing machine learning and AI into their applications, and GPUs have opened the door to new ML and AI use cases. With a GPU-accelerated database, you can explore data, formulate hypotheses, and use machine learning to find models that can be used to accomplish a wide variety of objectives, from providing personalized medicine, to financial trading, to detecting security intrusions, and much more.

 

 

 



Rommel Garcia, Director, Solution Engineering, Southeast, Kinetica

Rommel is Director of Solution Engineering, Southeast, based in Atlanta. Rommel came from Hortonworks as a Sr. Solutions Engineer and Security SME and worked there for four years specializing in Big Data platform architecture, security & governance, OLAP/OLTP/search and operations. Prior to Hortonworks, he worked at Liaison Technologies doing B2B, C2B, B2C, B2G, G2B data integration, master data management, tokenization security solutions, and data mapping/transformation. Java is his primary programming language, and he has also used perl, JavaScript, and other scripting languages. 

Rommel earned his Masters Degree in Computer Science and his Bachelors Degree in Electronics Engineering.