Our past Technical Chair, interviewed Dr. Le Song, Assistant Professor in the College of Computing, Georgia Institute of Technology, regarding his upcoming presentation Understanding Deep Learning for Big Data, scheduled for 09/23/16 at MLconf Atlanta.
You have done a lot of work on kernel methods. A year ago there was indication that kernel methods are not dead and that could match or outperform deep nets. Is that the case? Is it time for them to retire?
LS) I think kernel methods and deep learning are cousins. The field needs to combine them rather than throwing either away.
In fact, they share lots of similarities, such as both tries to learn nonlinear functions. One can design kernel functions to capture problem structure as much as one can choose the architecture of deep learning according to the problem at hand. More interestingly, the feature maps of kernel functions have one-to-one correspondence with the activation units in deep learning. For instance, the arc-cosine kernel is an infinite combination of rectified linear unit. Hence researchers also call kernel method infinite neural networks.
The main algorithmic difference between kernel methods and deep learning is that the parameters of the kernels are typically fixed and only the classifiers are learned, while the parameters of the activation units in deep learning are learned together with classifiers. This difference allows deep learning models to be more flexible given the same number of parameters. In the face of big data, this also means that deep learning is more scalable given the same model flexibility. Very few works have compared deep learning seriously with kernel methods in big data regime simply because nobody knows how to scale up kernel methods without sacrificing model flexibility. We did some fundamental work in scaling up kernel methods, and we found the matching performance between kernel methods and deep learning. Especially, nowadays when researchers are talking about learning the kernel functions, the line between kernel methods and deep learning really blurs.
The main theoretical difference between kernel methods and deep learning is that there is an almost completely set of theory to explain the behavior of kernel methods, while there is almost no satisfactory theory to explain the behavior of deep learning. When researchers talk about kernel methods, it means not just the set of algorithms, but also the set of unique tools and theories developed to analyze and provide guarantees for kernel methods. Researchers, including us, are working on explaining the generalization ability of deep learning models. In fact, we are now working on using tools from kernels methods to understand the generalization ability of deep learning models.
I think the next big thing will be kernel methods + deep learning + graphical models.
Deep nets have high information capacity versus linear models. If the data is noisy then all this capacity is filled with noise. Are there any practical methods for deciding where it is worth using a deep net versus a simpler model?
LS) Use cross-validation.
Given the success of deep nets, data scientists and researchers always wonder if it is time to give up traditional ML. Is there a reason why we should keep investing on other schools of ML, like bayesian modeling, graphical models, etc?
LS) Definitely, we should keep investing on other schools of ML, like bayesian modeling, graphical models, and we should keep teaching students bayesian modeling and graphical models. Not to mention that the possibility of new and better methods coming up from these other areas, they can simply inspire deep learning model. We recently invented a completely new deep learning architecture based on belief propagation algorithm from graphical models. Now our method is the best method for learning representation for discrete structures, such as taxonomies, drug molecules and social networks.
Given all this research in deep learning there seems to be plethora of architectures. Is there a map or a guideline for choosing the right architecture? Do we just have to follow the rule “do whatever google does”?
LS) A good architecture for a particular problem is heavily based on domain knowledge and previous successful models. For instance, convolution neural networks are based on previous image feature extractors such as pyramid features, recurrent neural networks are based on previous successful sequence models such as hidden Markov models. Google definitely has more computing resources to blindly search for a better architecture for a set of canonical problems, but your problem may be different from these canonical problems.
Neural nets have been there for decades. Why did it take so long to blast off? What do you think people missed in the 80s?
LS) I think neural networks are just more scalable than others nonlinear models out there. For big data, you need both flexibility and scalability. Neural networks have both.
Apart from the deep learning architecture labyrinth, scientists have also to go through the platform labyrinth (tensorflow, mxnet, theano, torch, keras, etc). Do you have any preference, advice?
LS) I prefer tensorflow and mxnet. You can program in python, and easily interface with other programs. mxnet is very efficient and faster than tensorflow. mxnet also has multi-GPU and multi-machine automatic parallelization at high speed.

Dr. Le Song, Assistant Professor in the College of Computing, Georgia Institute of Technology
Le Song is an assistant professor in the College of Computing, Georgia Institute of Technology. He received his Ph.D. in Machine Learning from University of Sydney and NICTA in 2008, and then conducted his post-doctoral research in the Department of Machine Learning, Carnegie Mellon University, between 2008 and 2011. Before he joined Georgia Institute of Technology, he was a research scientist at Google. His principal research direction is machine learning, especially nonlinear methods and probabilistic graphical models for large scale and complex problems, arising from artificial intelligence, social network analysis, healthcare analytics, and other interdisciplinary domains. He is the recipient of the NSF CAREER Award’14, AISTATS’16 Best Student Paper Award, IPDPS’15 Best Paper Award, NIPS’13 Outstanding Paper Award, and ICML’10 Best Paper Award. He has also served as the area chair for leading machine learning conferences such as ICML, NIPS and AISTATS, and action editor for JMLR.
Interview With Alex Korbonits, Data Scientist, Remitly
Our past Technical Chair, interviewed Alex Korbonits, Data Scientist at Remitly, about his thoughts on artificial intelligence as it relates to the arts.
What kind of features do you need to use when you train a model that generates art?
AK) Art is perceived. Since AlexNet in 2012, deep neural networks have come to be synonymous with the state-of-the-art in computer vision and a range of other perception tasks, pushing the boundaries of what machine learning models can achieve much further than ever before. Hand-crafted features are out and distributed feature representations are in. In other words, you don’t create specific features, instead, you create the model architecture. Stephen Merity made an excellent point about this recently in a blog post: “In deep learning, architecture engineering is the new feature engineering“.
In that sense, the question then becomes: what kinds of architectures do you need to use when you train a model that generates art? First of all, if you want to generate art you’d better use a generative model :). After that your choice of architecture should follow the data it’s modeling. For visual art you probably would want convolutions. For music you’d want recurrence. For film you’d want both.
Give us an overview of generative models that have been successful in generating art.
AK) There have been so many exciting and interesting projects in this space within the last year alone that I sadly have to limit myself to just the ones I know.
First of all, Google’s “Deep Dream” took the world by storm last summer. Google took their well-understood discriminative classifier, the Inception network built for the 2014 ImageNet competition (also known as GoogLeNet in an homage to Yann LeCun’s LeNet), and decided to go deeper, as it were, by building a generative visualization tool to give intuition for the kinds of features/concepts the classifier was learning at different layers of the network. They wrote up a blockbuster Google Research blog post about it, along with releasing a gitHub repo and an iPython Notebook demonstrating Deep Dream. Entire startups/homespun projects were created around Deep Dream to take any image and effectively create a version of it that would be reminiscent of a scene from Fear and Loathing in Las Vegas. The horror, the horror.
Second, Andrej Karpathy released his own code in a gitHub repo for a character-level recurrent neural network implementation (called char-rnn) in Torch whose creative properties he demonstrated in a very popular blog post. Among other things, he trained a character-level LSTM (long short term memory network) on different corpora, including War and Peace, the completed works of Shakespeare, an open source set of LaTeX files of papers in algebraic geometry, and the source code for Linux. This kicked off a series of humorous web applications such as an ironical clickbait headline generator reminiscent of BuzzFeed as well as myriad Twitter bots generating sophisticated fake tweets for any given well-known persona for whom it is easy to mock ;). Traditionally, a lot of the generative models used for this kind of thing were simple markov chains, but LSTMs learn long-term dependencies that are comparatively impressive. Even training an LSTM on James Joyce’s Ulysses doesn’t look too far off from the real thing.
Third, an interesting area of using generative models to create art is in music. There’s a lot of work to be done here in terms of richness and training on audio recordings, but some low-hanging fruit has already been found. A couple of implementations of this that I know of thus far: (1) using an LSTM (even Karpathy’s char-rnn) to take music with a text encoding (such as MIDI files) to generate music from training data (which I had some fun playing with a year ago training on MIDI files of Beethoven piano sonatas but didn’t post); and (2) an awesome blog post at Daniel Johnson’s blog hexahedria, wherein he describes an implementation of what he calls a “Biaxial Recurrent Neural Network for Music Composition” that seems to generate results far superior to mine.
Fourth, I would be remiss not to discuss the well-known “A Neural Algorithm of Artistic Style”. This has now spawned an app called Prisma that, while not quite as white-hot popular as Pokemon Go, is nonetheless very much in vogue as you are reading this. A generative model is trained on a base image and a style image, whose output is a composition that resembles the base image in the “style” of the style image. Now you can even transfer style while keeping the original colors, which was not the case in the original implementation. Functionally this means that, for example, I can take a normal photograph of the Space Needle in Seattle and generate a version of it inspired by a favorite cubist painting by Braque.
Last, and perhaps most recently, we saw that on Ars Technica a group of artists released a short film called Sunspring whose script was entirely composed by “an AI” (which I have a hunch was trained on a derivative of Karpathy’s char-rnn), trained on a large corpora of science fiction scripts and presumably given different plot-related prompts/primers to generate short sequences that made up the entirety of the final script. We now see real artists using state-of-the-art machine learning models to assist in the creative process. How cool is that?
Can deep learning embeddings provide dimensions that are associated with art measures, that humans pick up?
AK) Yes.
A recent paper, “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” explores this idea. First, the authors determine semantically meaningful directions within the embeddings to locate and expose bias. They then exploit the existence of these directions to combat bias. Super cool.
In this sense, I think it is totally possible to create/discover dimensions/directions within an embedding that are associated with human-interpretable art measures. You could stay within the medium you’ve embedded to explore and interpret the space — e.g., paintings — or you could marry up your embedding of paintings with some word embeddings (perhaps via descriptions of your paintings or painting metadata) to help understand how you’re moving around it. Chris Moody from Stitch Fix gave a great talk at Data Day Seattle this summer combining word embeddings and images of clothing items to make recommendations to customers. There’s no reason you couldn’t do this with art. For example, you could explain the artistic/aesthetic differences in style between two similar paintings by different artists (such as cubist portraits by Picasso and Braque) by comparing color palette, brushstrokes, shading, or other aspects you want to inspect along the relevant directions in your embedding.
Is there a word2vec embedding for art so that you can transform a piece of music by just adding a vector to it?
AK) Mozart + Metallica – Beatles = Beethoven? As long as you can take a piece of music and properly model it, there’s no reason in principle why you couldn’t embed a specific piece of music into some kind of metric space (either as a single point or perhaps more intuitively as a sequence of points) that you could then manipulate within that space to transform along specific directions. E.g., you could modulate the key of a piece or a specific passage from major to minor, or perhaps along a direction that changes the instrument playing from clarinet to oboe.
I haven’t come across music2vec yet but I eagerly await its arrival. I imagine that this would come about via sequence-to-sequence models such as LSTMs since music, like text, is inherently sequential.
For visual art embeddings you could use convolutional neural networks.
I think autoencoders could be useful here too.
If we can computerize art by massively generating it automatically, don’t we create an inflation? What is the value of artificial art? Isn’t art supposed to be rare and unique?
AK) This question was posed 80 years ago in a very famous essay titled “The Work of Art in the Age of Mechanical Reproduction” by Walter Benjamin, wherein the art historical properties of prints of famous works of art are examined. At issue was the value of art in an age where it was — all of the sudden — possible to print a poster of Picasso’s Les Demoiselles d’Avignon you could sell to anyone trying to find something to put up on the walls of their bedrooms, dorm rooms, apartments, homes, offices, restaurants… you name it. The mechanical reproduction of art doesn’t devalue art itself simply by virtue of increasing access and awareness of rare and unique art. However, it does affect the typical experience of art from active engagement to passive consumption.
Another famous take on this question is in Clement Greenberg’s essay “Avante-garde and Kitsch“, wherein two diametrically opposed categories of art — avante-garde art and kitsch art — are described and contrasted to highlight the purposes and properties of each.
Greenberg paraphrases Aristotle in suggesting that if all art and literature are imitation (of reality), then avante-garde art is the imitation of imitation: it is art concerned with the process of creating art for art itself, independent of external meaning. He then contrasts this with kitsch by saying that kitsch is not concerned with the process of art but with the effect of art (on a consumer). At first glance, it would seem as though generating art with artificial intelligence is sort of both avante-garde and kitsch. However, this process is very mechanistic and literally formulaic. We’re not at a point where ML models are all of the sudden generating art that is as original or as emblematic of artistic genius as, e.g., the first Pollock. Sure, generative machine learning models trained on art are mimetic w.r.t. the process of creating art (e.g., DRAW: A Recurrent Neural Network For Image Generation), but these models create art that has the effect of looking like art we already know about. I.e., it’s definitely kitschy to generate a Van Gogh styled photo of a landscape someone took on their cell phone. Or a Warhol or Lichtenstein selfie. At this point, it’s kitschy to generate a work of art end-to-end from a model.
Let me be clear: kitschy art generated by a model is not artificial art. It’s art. Just because it is possible to massively generate art in an automatic way does not mean that all other art is thereby devalued. If anything, I would argue that creating huge quantities of kitschy art helps highlight the uniqueness, rarity, and value of art that is not automatically generated.
We’re beginning to see the power of artists using artificial intelligence as a part of the overall creative/artistic process. As we saw with Sunspring, it’s possible to use AI as a tool in this process, not to replace and automate the process itself. You can use AI as part of the process of creating art and still be avante-garde. AI is adding value to the creation of new art and is highlighting the importance of art in a society that otherwise seems to be increasingly and singularly obsessed with STEM.
This is an exciting time where new tools are being developed and artists are trying them out. Can’t wait to see what comes next.

Alex Korbonits is a Data Scientist at Remitly, Inc., where he works extensively on feature extraction and putting machine learning models into production. Outside of work, he loves Kaggle competitions, is diving deep into topological data analysis, and is exploring machine learning on GPUs. Alex is a graduate of the University of Chicago with degrees in Mathematics and Economics.
Interview with Beverly Wright, Executive Director, Business Analytics Center, Georgia Institute of Technology
Our past Technical Chair, interviewed Beverly Wright, Executive Director, Business Analytics Center, Georgia Institute of Technology.
There is a lot of talk about using machine learning for different business applications. Why has this become such a popular topic?
BW) There are several reasons why I suspect we’re seeing such an increase in machine learning for business. One of the many reasons I’d attribute to the growth in ML popularity is data:
We’ve seen an incredible increase in the amount of data we can capture and use. We aren’t able to efficiently take on a manual process to explore data but we know that insights are buried within the data. Machine learning can help guide us earlier in the analytics lifecycle to figure out our business questions, and start to give a sense of direction.
Many companies are talking about machine learning, but usage appears to happen at different rates, levels, and intensities. Why are the adoption rates different?
BW) On the surface, using machine learning to improve business decision-making seems like a no-brainer. You might wonder why all companies have not adopted this approach, full force. Before any new technique, process, system, or even mindset is adopted, several challenges and boundaries need to be overcome. Adoption rates most likely vary for a number of reasons, including cultural acceptance, using data to drive decisions, technical expertise, understanding and interpreting results in relevant ways, among many other factors. What might seem like an easy and logical decision from the outside, could take a long time to implement for a variety of reasons.
Where is machine learning going for business? Is this a fad, but we see more, will it change? What do you think the future holds for machine learning in business applications?
BW) I think we’ll see more machine learning applications in business, particularly as large innovative companies have improved and proven the value. The diversity of data structures, coupled with the increased volumes, plus the need for real-time answers, all beg for new ways of harnessing and using data for business decisions.

Dr. Beverly Wright leads the Business Analytics Center at Georgia Institute of Technology’s Scheller College of Business. Beverly brings over twenty years of marketing analytics and insights experience from corporate, consulting and academia. In her consultative roles for both nonprofits and for profit businesses, she has solved critical issues through the use of modeling and advanced analytics. Her academic experience spans over a decade with a strong emphasis toward community engagement and experiential learning. She’s also worked for companies within or leading Marketing Analysis departments.
Beverly earned a PhD in Marketing Analysis, a Master of Science degree in Analytical Methods, and a Bachelor of Business Administration degree in Decision Sciences from Georgia State University. She has also received a Professional Research Certification from the Marketing Research Association and CAP certification from INFORMS. Dr. Beverly Wright regularly presents at professional and academic conferences, as well as publishes articles in various business journals.
Interview with Navdeep Gill, Data Scientist, H2O.ai
Our past Technical Chair, interviewed Navdeep Gill about his thoughts on various Machine Learning techniques, debugging models and visualizing big dimensional data.
What is the challenge in debugging a machine learning model?
NG) Currently, traditional software programming uses Boolean-based logic that can be tested to confirm that the software does what it was designed to do, using tools and methodologies established over the last few decades. In contrast, machine learning is essentially a black box programming method in which computers program themselves with data, producing probabilistic logic that diverges from the true-and-false tests used to verify systems programmed with traditional Boolean logic methods. The challenge here is that the methodology for scaling machine learning verification up to a whole industry is still in progress. We have some clues for how to make it work, but we don’t have the decades of experience that we have in developing and verifying “regular” software.
Instead of traditional test suite assertions that respond with true, false or equal, machine learning test assertions need to respond with assessments. For example, the results of today’s experiment had an AUC of .95 and is consistent with tests run yesterday. Another challenge is that machine learning systems trained on data produced by humans with inherent human biases will duplicate those biases in their models. A method of measuring what these systems do compared to what they were designed to do is needed in order to identify and remove bias.
Furthermore, traditional software is modular, lending to isolation of the input and outputs of each module to identify which one has the bug. However, in machine learning the system has been programmed with data. Any bug will be replicated throughout the system. Changing one thing changes everything. There are techniques for understanding that there is an error, and there are methods for retraining machine learning systems. However, there is no way to fix a single isolated problem, which is another huge challenge in itself.
To paraphrase, a better set of tools is needed. The entire tool set needs to be updated to move forward. People in industry, including at H2O.ai, are working on this exact problem to ensure the models that are built in production are as accurate as possible and produce results that are aligned with the real world.
What are the limitations of explaining a model with histograms, scatterplots?
NG) The limitations come in the dimensionality of datasets and the dimensionality built out in model building. Most machine learning models express interesting interactions among features in an n-dimensional space. Visualizing the effect of one or two variables on an outcome is simple. However, once you go into more than two dimensions we reach a problem with interpretability and approach the infamous curse of dimensionality. For example, tree models and deep learning involve a tremendous amount of interactions between features, and these interactions only become more complex as the feature space increases. This can pose a huge hurdle in terms of scaling these out to traditional visualizations, and thus increase the difficulty of explaining a machine learning model using traditional visualization methodology.
Is tensorboard a step forward in visual debugging of a model?
NG) Yes, I believe so. Tensorboard allows you to visualize the graph itself, which is quite helpful for most users of TensorFlow. When building and debugging new models, it is easy to get lost in the weeds. For me, holding a mental context for a new framework and model I’m building to solve a hard problem is already pretty taxing, so it can be really helpful to inspect a totally different representation of a model; the TensorBoard graph visualization is great for this.
What is the next step for visualizing machine learning?
NG) The next step involves building visualizations that can be used to educate, propose business value and be used in an interactive manner for all machine learning purposes. A great example of this is R2D3, which is a product by Tony Chu from H2O.ai. This visual tutorial walks you through a machine learning use case explaining each step along the way using insightful visuals that can peak the interest of any person in the field of machine learning. These kind of visualizations can also be built out to show stakeholders business value for almost any business problem that requires machine learning. In addition, they can help a residential Data Scientist get insight into their data/machine learning models with ease.
What is the solution to visualizing big high dimensional data?”
NG) Two possible solutions to this problem are dimensionality reduction and feature selection. Dimensionality reduction involves taking your feature space and projecting it down to a lower dimensional space, which can help visualize huge data sets. Some examples of this are PCA analysis, multidimensional scaling and t-SNE, to name a few.
Feature selection allows you to choose the “best” features from your data, which you can visualize. “Best” could be defined based on you cost function or variable importance. For example, you could visualize the top five features that are the most important to your model, or you can use some type of criterion such as Weight of Evidence or Information Value to choose the “most important” variables. In addition to the previous, one can also conduct visuals such as parallel coordinates, scatter plot matrices, Glyphplot’s, Andrew’s plots or Arc-Diagrams.

Navdeep Gill is a Data Scientist at H2O.ai. He graduated from California State University, East Bay with a M.S. degree in Computational Statistics, B.S. in Statistics and a B.A. in Psychology (minor in Mathematics). During his education he discovered an interest in machine learning, time series analysis, statistical computing, data mining and data visualization.
Prior to H2O.ai Navdeep worked at several startups and Cisco Systems, focusing on data science, software development and marketing research. Before that, he was a consultant at FICO working with small- to mid-size banks in the U.S. and South America focusing on risk management across different bank portfolios (car loans, home mortgages and credit cards).
Interview with Erin LeDell, Machine Learning Scientist, h2o.ai
Our past Technical Chair, discussed Erin LeDell’s upcoming talk: Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches at MLconf Seattle, scheduled for May 20th.
Xavier Amatriain responded to Pedro Domingos with a tweet, that ensembling is the master algorithm. Would you agree?
EL) If there is one supervised learning algorithm that I’d consider a “master algorithm,” then yes, I would consider ensembling — particularly stacking, also known as “Super Learning”, to be this algorithm. The “best” supervised learning algorithm that exists now or is invented in the future can always be incorporated into a Super Learner to achieve better performance. That’s the power of stacking. I don’t believe that there’s a single algorithm that can consistently outperform all other algorithms on all types of data — we’ve all heard the saying that there’s “there’s no free lunch”; that’s why ensembles are generally more useful than a single algorithm. The “No Free Lunch Theorem” is credited to David Wolpert, the inventor of stacking, incidentally.
The way Pedro Domingos describes the “master algorithm” is more than just a supervised learning algorithm — he talks about the importance of having a feedback loop, or a holistic learning system that includes machine learning algorithms at the center. Then there’s also Artificial General Intelligence (AGI), which is probably more of a master algorithm than anything else. For more information about the current state of AGI, look into the work of Ben Goertzel.
Does ensembling destroy model interpretability? A decision tree offers a nice path to explaining predictions, but a forest?
EL) Ensembles are inherently more complex and opaque than their singleton counterparts. However, anyone who cares about model performance will have a hard time justifying the use of a decision tree or linear model over an ensemble learner.
Some use-cases value model interpretability over model performance and that’s where I’ve seen people make this trade-off. However, there is a lot of work going into black-box model interpretability and I am optimistic about some of the recent advancements in that area. For example, Local Interpretable Model-Agnostic Explanations (LIME), is a method for explaining the predictions of any classifier.
What was the impact of your recent scalable implementation at H2O? What was the speedup versus the legacy implementation? What was the impact for your current customers?
EL) The H2O Ensemble project is a truely scalable ensemble learner because it is built on top of H2O, an open source, scalable, distributed machine learning platform. H2O Ensemble implements the stacking algorithm for big data.
I suppose the “legacy” stacking implementation is the SuperLearner R package. Other than Weka’s Stacking function, the SuperLearner and subsemble R packages contain the only proper stacking implementations that I’m aware of. The speed-up of H2O Ensemble over the SuperLearner package is essentially unlimited because there is no limit to the size of an H2O cluster. There are more details in my dissertation.
I think it’s fairly common for companies to deploy simple linear models into production because it’s the easiest algorithm to scale. Some of the more sophisticated teams deploy a GBM or Random Forest. This leaves a lot of model performance (and associated revenue) on the table. When model performance translates directly to revenue, the advantages of ensembling are very convincing.
Have you ever seen a case where ensembling was not a good choice?
EL) Yes, there are cases where I’ve been able to achieve equal or slightly better performance using a single Gradient Boosting Machine (GBM), for example, than I have with an ensemble containing that same GBM. This means that the GBM is able to approximate the true prediction function well, and I haven’t done a good enough job of creating a diverse set of base learners for my ensemble, or I’ve made a poor choice of metalearning algorithm for the Super Learner.
This situation is far less likely when using stacking (vs some more basic ensembling scheme), however it is important to acknowledge that it happens from time to time. In addition to the causes above, I’ve seen this happen when I’ve spent a lot of time fine-tuning one or more of the constituent algorithms, or when there is not enough training data.
How easy it is to jump from Biostatistics to Data Science? Is the reverse path harder?
EL) For me, there was no division between the two. As a Biostatistics PhD student, I used real-life, clinical datasets as my motivation for developing new algorithms and software, which is very close to what I do in industry now. Not all biostatisticians spend as much time developing software as I did, but I think there is a good overlap between the two. People who chose a Biostatistics PhD program over a Statistics PhD program may be more interested in developing methodology and software for messy real-life data and applications.
As for the reverse path, as long as you have a math/stats background, then I think the transition from Data Science to Biostatistics/Statistics would be easy.
Is the machine learning/data science hype stagnating life sciences by attracting the best talent? Is there a way to steer scientists into curing cancer rather than increase ad clicking?
EL) It’s true that a lot of data scientists come from a pure science background — there are many astronomers, physicists, engineers who have left their respective fields for a career in data science. I have many friends with PhDs (from very good schools!) who can’t find a decent academic or industry job in their area of research. That is a shame, no doubt. The issue is not that the data science hype is causing these folks to leave their fields and enter data science — it’s that our country/society does not prioritize basic research enough to support research-based career paths. Rather than going to work at a consulting firm or applying for their third post-doc, data science is offering these folks a career using many of their highly-valuable skills such as like data analysis and programming (although often in a different subject area).
I think how you apply your data science / machine learning skillset matters a great deal. Some people say that “data is just data” and therefore it doesn’t matter how they apply their skills, but whether that’s a belief you subscribe to or not — everyone is responsible for what they create in this world.
We are at a point where Data Scientists and Computer Scientists hold quite a bit of power and we have an opportunity to decide how and where we’d like to wield this power. Perhaps this will upset some people who work in this industry (or maybe they will even agree with me), but I believe that ad-click related jobs are a horrible waste of our collective human potential.
The healthcare industry, at least in the United States, is one of the least nimble and technologically backwards industries that we have. It is hard to innovate in this area, which might be a reason why smart, talented people stay away. On the other hand, we are starting to see new companies applying machine learning to healthcare in highly innovative and exciting ways. For example, Jeremy Howard’s Enlitic is applying deep learning to medical imaging. The more successful examples that we have of this type of thing, the more that will inspire data scientists to leave their ad-click jobs for careers that lead to curing cancer.

Erin LeDell, Machine Learning Scientist, h2o.ai
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc.
Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from UC Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
Interview with Kristian Kersting, Associate Professor for Computer Science, TU Dortmund University
Our past Technical Chair, discussed Kristian Kersting’s upcoming talk: Declarative Programming for Statistical ML at MLconf Seattle, scheduled for May 20th.
Why do you think expressing machine learning in a relational way would democratize machine learning?
KK) Consider a typical machine learning user in action solving a problem for some data. She selects a model for the underlying phenomenon to be learned (choosing a learning bias), formats the raw data according to the chosen model, and then tunes the model parameters by minimizing some objective function induced by the data and the model assumptions. Often, the optimization problem solved in the last step falls within a class of mathematical programs for which efficient and robust solvers are available. Unfortunately, however, today’s solvers for mathematical programs typically require that the mathematical program is presented in some canonical algebraic form or offer only some very restricted modeling environment. The process of turning the intuition that defines the model “on paper” into a canonical form could be quite cumbersome. Moreover, the reusability of such code is limited, as relatively minor modification could require large modifications of the code. This is where declarative modelling languages such as RELOOP enter the stage. They free the machine learning user from thinking about the canonical algebraic form and instead help focusing on the model “on paper”. They allow the user to abstract from entities and in turn to formulate general objectives and constraints that hold across different situations. All this increases the ability to rapidly combine, deploy, and maintain existing algorithms. To be honest, however, relations are not everything. This is why we embedded the relational language into an imperative language featuring for-loops.
The big data hype has lead a lot of people focusing on improving scalability of exotic algorithms. You have chosen two: Linear Programming and Quadratic Programming to do machine learning and focus more on feeding the data in. What are the pros and cons of this approach?
KK) We just started with LPs and QPs since they are the working horses of classical machine learning. This ways we build a basis for what might be called relational statistical learning. Afterwards, we will move on to other machine learning approaches, even maybe deep learning approaches.
Do you think it’s more important for a data scientist to easily customize the objective and encode constraints rather than use a fancy ML algorithm?
KK) Good question. I think this depends on the application. We are collaborating a lot with plant physiologists. They once asked me, what is the biological meaning of an eigenvector’? Quite difficult to answer. Or consider stochastic gradients. They argued `so you want me to through away 90% of my data? How do I explain this to my students, who spent 200 days in the field to gather the data?’. Similar with deep learning. They want to trust the algorithm, at least in the beginning of data science project, and to gain insights into their data. Here is where focusing on the constraints can help; they might be easier to understand, at least when encoded in a high-level language. Or, consider collective classification, i.e., the classification of one entity may change the classification of a related entity. Typically one uses a kernel to realize this when using support vector machines. However, just placing some additional constraints encoding that related entities should be on the same side of the hyperplane can do the job, too, as this also captures the manifold structure in the high-dimensional feature space. Unfortunately, in contrast to the AI community, the ML community has not really developed a methodology for constraints, yet.
There is big crowd from engineering with expert skills in optimization that has struggled to get into data science and earn the corresponding salary, do you think you are opening a door for them?
KK) Hopefully we can at least help. Optimisation is definitely one of the foundations of statistical machine learning. High-level languages for optimization hopefully help to talk about the models and hence to bridge the disciplines even further.
Do you see the solvers becoming scalable enough so that your approach can be applied to big data? Is there a different path?
KK) Hmm, scalability is always an issue. However, it is not just the solver but the way the solver interacts with the modelling language. Consider e.g. relational models. They often have symmetries that can be used to reduce the model automatically. This is sometimes used lifted inference. And, we have just started to exploit structure within statistical machine learning. Imagine a cutting plane solver that is not computing the mostly violated constraint but the mostly violated and fastest to compute one. As yet another example, one of my PhD students, Martin Mladenov, just found a nice way to combine matrix free optimization methods together with relational languages such as RELOOP. With this we can solve problems involving a Billion of non-zero variables faster than Gurobi. So at least there is strong evidence that a new generation of solvers can scale well, even better than existing ones. Moreover, instead of compiling to an intermediate structure, why not compiling directly into a low-level C/C++ program that implements a problem-specific solver. In a sense, I envision a “-O” flags for machine learning, very much like we know it from C/C++ compilers.
How does your Relational Linear Programing play along with the Relational Databases?
KK) Relational DBs have be the home of high-value, data-driven applications for over four decades. This may explain why you see a push in industry to marry statistical analytic frameworks like R and Python with almost every data processing engine. As a machine learner this is nice as you do not have to worry about the data management and retrieval anymore. However, it is tricky to just map the data from a relational DB into a single table, the traditional representation for machine learning. You are likely to change the statistics. We need a relational machine learning that can deal with entities and relations. We just started with LPs and QPs since they are the working horses of classical machine learning. In the long run, we want to develop a tight integration of Relational Databases and Machine Learning, even maybe something like Deep Relational Machines.

Kristian Kersting, Associate Professor for Computer Science, TU Dortmund University
Kristian Kersting is an Associate Professor for Computer Science at the TU Dortmund University, Germany. He received his PhD from the University of Freiburg, Germany, in 2006. After a PostDoc at MIT, he moved to the Fraunhofer IAIS and the University of Bonn using a Fraunhofer ATTRACT Fellowship. His main research interests are data mining, machine learning, and statistical relational AI, with applications to medicine, plant phenotpying, traffic, and collective attention. Kristian has published over 130 technical papers, and his work has been recognized by several awards, including the ECCAI Dissertation Award for the best AI dissertation in Europe.
He gave several tutorials at top venues and serves regularly on the PC (often at the senior level) of the top machine learning, data mining, and AI venues. Kristian co-founded the international workshop series on Statistical Relational AI and co-chaired ECML PKDD 2013, the premier European venue for Machine Learning and Data Mining, as well as the Best Paper Award Committee of ACM KDD 2015. Currently, he is an action editor of DAMI, MLJ, AIJ, and JAIR as well as the editor of JAIR’s special track on Deep Learning, Knowledge Representation, and Reasoning.