Our past Technical Chair, interviewed Dr. Le Song, Assistant Professor in the College of Computing, Georgia Institute of Technology, regarding his upcoming presentation Understanding Deep Learning for Big Data, scheduled for 09/23/16 at MLconf Atlanta.
You have done a lot of work on kernel methods. A year ago there was indication that kernel methods are not dead and that could match or outperform deep nets. Is that the case? Is it time for them to retire?
LS) I think kernel methods and deep learning are cousins. The field needs to combine them rather than throwing either away.
In fact, they share lots of similarities, such as both tries to learn nonlinear functions. One can design kernel functions to capture problem structure as much as one can choose the architecture of deep learning according to the problem at hand. More interestingly, the feature maps of kernel functions have one-to-one correspondence with the activation units in deep learning. For instance, the arc-cosine kernel is an infinite combination of rectified linear unit. Hence researchers also call kernel method infinite neural networks.
The main algorithmic difference between kernel methods and deep learning is that the parameters of the kernels are typically fixed and only the classifiers are learned, while the parameters of the activation units in deep learning are learned together with classifiers. This difference allows deep learning models to be more flexible given the same number of parameters. In the face of big data, this also means that deep learning is more scalable given the same model flexibility. Very few works have compared deep learning seriously with kernel methods in big data regime simply because nobody knows how to scale up kernel methods without sacrificing model flexibility. We did some fundamental work in scaling up kernel methods, and we found the matching performance between kernel methods and deep learning. Especially, nowadays when researchers are talking about learning the kernel functions, the line between kernel methods and deep learning really blurs.
The main theoretical difference between kernel methods and deep learning is that there is an almost completely set of theory to explain the behavior of kernel methods, while there is almost no satisfactory theory to explain the behavior of deep learning. When researchers talk about kernel methods, it means not just the set of algorithms, but also the set of unique tools and theories developed to analyze and provide guarantees for kernel methods. Researchers, including us, are working on explaining the generalization ability of deep learning models. In fact, we are now working on using tools from kernels methods to understand the generalization ability of deep learning models.
I think the next big thing will be kernel methods + deep learning + graphical models.
Deep nets have high information capacity versus linear models. If the data is noisy then all this capacity is filled with noise. Are there any practical methods for deciding where it is worth using a deep net versus a simpler model?
LS) Use cross-validation.
Given the success of deep nets, data scientists and researchers always wonder if it is time to give up traditional ML. Is there a reason why we should keep investing on other schools of ML, like bayesian modeling, graphical models, etc?
LS) Definitely, we should keep investing on other schools of ML, like bayesian modeling, graphical models, and we should keep teaching students bayesian modeling and graphical models. Not to mention that the possibility of new and better methods coming up from these other areas, they can simply inspire deep learning model. We recently invented a completely new deep learning architecture based on belief propagation algorithm from graphical models. Now our method is the best method for learning representation for discrete structures, such as taxonomies, drug molecules and social networks.
Given all this research in deep learning there seems to be plethora of architectures. Is there a map or a guideline for choosing the right architecture? Do we just have to follow the rule “do whatever google does”?
LS) A good architecture for a particular problem is heavily based on domain knowledge and previous successful models. For instance, convolution neural networks are based on previous image feature extractors such as pyramid features, recurrent neural networks are based on previous successful sequence models such as hidden Markov models. Google definitely has more computing resources to blindly search for a better architecture for a set of canonical problems, but your problem may be different from these canonical problems.
Neural nets have been there for decades. Why did it take so long to blast off? What do you think people missed in the 80s?
LS) I think neural networks are just more scalable than others nonlinear models out there. For big data, you need both flexibility and scalability. Neural networks have both.
Apart from the deep learning architecture labyrinth, scientists have also to go through the platform labyrinth (tensorflow, mxnet, theano, torch, keras, etc). Do you have any preference, advice?
LS) I prefer tensorflow and mxnet. You can program in python, and easily interface with other programs. mxnet is very efficient and faster than tensorflow. mxnet also has multi-GPU and multi-machine automatic parallelization at high speed.

Dr. Le Song, Assistant Professor in the College of Computing, Georgia Institute of Technology
Le Song is an assistant professor in the College of Computing, Georgia Institute of Technology. He received his Ph.D. in Machine Learning from University of Sydney and NICTA in 2008, and then conducted his post-doctoral research in the Department of Machine Learning, Carnegie Mellon University, between 2008 and 2011. Before he joined Georgia Institute of Technology, he was a research scientist at Google. His principal research direction is machine learning, especially nonlinear methods and probabilistic graphical models for large scale and complex problems, arising from artificial intelligence, social network analysis, healthcare analytics, and other interdisciplinary domains. He is the recipient of the NSF CAREER Award’14, AISTATS’16 Best Student Paper Award, IPDPS’15 Best Paper Award, NIPS’13 Outstanding Paper Award, and ICML’10 Best Paper Award. He has also served as the area chair for leading machine learning conferences such as ICML, NIPS and AISTATS, and action editor for JMLR.