Our past Technical Chair, interviewed Navdeep Gill about his thoughts on various Machine Learning techniques, debugging models and visualizing big dimensional data.
What is the challenge in debugging a machine learning model?
NG) Currently, traditional software programming uses Boolean-based logic that can be tested to confirm that the software does what it was designed to do, using tools and methodologies established over the last few decades. In contrast, machine learning is essentially a black box programming method in which computers program themselves with data, producing probabilistic logic that diverges from the true-and-false tests used to verify systems programmed with traditional Boolean logic methods. The challenge here is that the methodology for scaling machine learning verification up to a whole industry is still in progress. We have some clues for how to make it work, but we don’t have the decades of experience that we have in developing and verifying “regular” software.
Instead of traditional test suite assertions that respond with true, false or equal, machine learning test assertions need to respond with assessments. For example, the results of today’s experiment had an AUC of .95 and is consistent with tests run yesterday. Another challenge is that machine learning systems trained on data produced by humans with inherent human biases will duplicate those biases in their models. A method of measuring what these systems do compared to what they were designed to do is needed in order to identify and remove bias.
Furthermore, traditional software is modular, lending to isolation of the input and outputs of each module to identify which one has the bug. However, in machine learning the system has been programmed with data. Any bug will be replicated throughout the system. Changing one thing changes everything. There are techniques for understanding that there is an error, and there are methods for retraining machine learning systems. However, there is no way to fix a single isolated problem, which is another huge challenge in itself.
To paraphrase, a better set of tools is needed. The entire tool set needs to be updated to move forward. People in industry, including at H2O.ai, are working on this exact problem to ensure the models that are built in production are as accurate as possible and produce results that are aligned with the real world.
What are the limitations of explaining a model with histograms, scatterplots?
NG) The limitations come in the dimensionality of datasets and the dimensionality built out in model building. Most machine learning models express interesting interactions among features in an n-dimensional space. Visualizing the effect of one or two variables on an outcome is simple. However, once you go into more than two dimensions we reach a problem with interpretability and approach the infamous curse of dimensionality. For example, tree models and deep learning involve a tremendous amount of interactions between features, and these interactions only become more complex as the feature space increases. This can pose a huge hurdle in terms of scaling these out to traditional visualizations, and thus increase the difficulty of explaining a machine learning model using traditional visualization methodology.
Is tensorboard a step forward in visual debugging of a model?
NG) Yes, I believe so. Tensorboard allows you to visualize the graph itself, which is quite helpful for most users of TensorFlow. When building and debugging new models, it is easy to get lost in the weeds. For me, holding a mental context for a new framework and model I’m building to solve a hard problem is already pretty taxing, so it can be really helpful to inspect a totally different representation of a model; the TensorBoard graph visualization is great for this.
What is the next step for visualizing machine learning?
NG) The next step involves building visualizations that can be used to educate, propose business value and be used in an interactive manner for all machine learning purposes. A great example of this is R2D3, which is a product by Tony Chu from H2O.ai. This visual tutorial walks you through a machine learning use case explaining each step along the way using insightful visuals that can peak the interest of any person in the field of machine learning. These kind of visualizations can also be built out to show stakeholders business value for almost any business problem that requires machine learning. In addition, they can help a residential Data Scientist get insight into their data/machine learning models with ease.
What is the solution to visualizing big high dimensional data?”
NG) Two possible solutions to this problem are dimensionality reduction and feature selection. Dimensionality reduction involves taking your feature space and projecting it down to a lower dimensional space, which can help visualize huge data sets. Some examples of this are PCA analysis, multidimensional scaling and t-SNE, to name a few.
Feature selection allows you to choose the “best” features from your data, which you can visualize. “Best” could be defined based on you cost function or variable importance. For example, you could visualize the top five features that are the most important to your model, or you can use some type of criterion such as Weight of Evidence or Information Value to choose the “most important” variables. In addition to the previous, one can also conduct visuals such as parallel coordinates, scatter plot matrices, Glyphplot’s, Andrew’s plots or Arc-Diagrams.

Navdeep Gill is a Data Scientist at H2O.ai. He graduated from California State University, East Bay with a M.S. degree in Computational Statistics, B.S. in Statistics and a B.A. in Psychology (minor in Mathematics). During his education he discovered an interest in machine learning, time series analysis, statistical computing, data mining and data visualization.
Prior to H2O.ai Navdeep worked at several startups and Cisco Systems, focusing on data science, software development and marketing research. Before that, he was a consultant at FICO working with small- to mid-size banks in the U.S. and South America focusing on risk management across different bank portfolios (car loans, home mortgages and credit cards).