One of our MLConf Program Committee members, Reshama Shaikh, recently interviewed Leslie N. Smith, PhD, Senior Research Scientist at the US Naval Research Laboratory.
RS) Tell us briefly about yourself and your work.
LS) I am very fortunate that I really like my work. Deep learning is a fascinating field and I enjoy discovering clever ideas. Also, I try hard to have my own clever ideas. The field is so hot that it is a challenge to go from clever idea to published paper before anyone else. It is like a competitive game and sometimes I win and other times not. I’ve lost count of the number of ideas I’ve had that others posted first. I am especially interested in discovering ideas that give us a deeper understanding of how and why deep networks work.
RS) What was your path into deep learning?
LS) I’d been working in the area of computer vision for several years when Krizhevsky, et al. (2012) showed a significant improvement in image classification by using a neural network. I started looking at deep learning in 2013 and soon my research interests focused deep learning for computer vision.
RS) Your research introduced cyclical learning rates back in 2015. I learned of your research in Cyclical Learning Rates (CLR) in Fall 2017 from Jeremy Howard, who teaches fastai’s deep learning MOOC (http://course.fast.ai). For those who are unfamiliar, how would you explain what a CLR is?
LS) Learning rates are the most important hyper-parameter to set properly and can significantly affect the performance of your trained network. Common practice is to set the learning rate to a constant value and decrease it by an order of magnitude once the accuracy has plateaued. If one uses an initial constant learning rate that is larger or smaller than the optimal learning rate, then the network’s performance is degraded but the only way to find an optimal learning rate was to perform a grid search over possible learning rates. This was a tedious and time consuming process. My thought was to let the learning rate vary between a minimum and maximum learning rate value during the course of the training. It seemed easier to choose a range in the vicinity of the optimal learning rate than to find the optimal learning rate directly. Since the learning rate changes between these bounds, it spends part of the training close to the optimal value.
Furthermore, I realized that a single run where the learning rate increased from a small value to a large value provides valuable insight as to the minimum, maximum, and optimal learning rate values. I called this the learning rate range test.
RS) The fastai community is thrilled with your research since it advances identifying optimal learning rate, one hyperparameter in deep learning models. What are your thoughts on your significant contribution to the field?
Tweet: “With the addition of cyclical momentum (thanks to @GuggerSylvain) fastai is now the first library to fully integrate Leslie Smith’s 1cycle learning method, which means you can train your model 5x faster”
LS) I am very flattered that the fastai community is interested in my work. I believe that the learning rate is one of the most important hyperparameters to get right so a method to find an optimal value should be quite useful. Personally, I believe that the super-convergence method that grew out the cyclical learning rates is even more significant. Using very large learning rates with the 1cycle learning rate policy leads to an order of magnitude speed up of training. That is worthwhile for everyone to notice.
RS) Your recent paper, which is 20 pages, A Disciplined Approach to Neural Network Hyper-parameters: Part 1 – Learning Rate, Batch Size, Momentum and Weight Decay, is filled with gold nuggets for the deep learning practitioner. How much time do you invest in producing one research paper?
LS) Last December I started on something totally different. I was testing an idea on how to best design skip connections in architectures. Looking at resnets, densenets, and now hyper-densenets, it is clear that skip connections are important but there doesn’t exist a rule of thumb guiding how to add skip connections in general. While running experiments testing my idea, I observed some phenomena I didn’t understand so I backtracked into gaining a better understanding of the hyper-parameters. Since January I’ve run well over a thousand experiments, where each experiment consisted of 4 runs. The process I use is: think of an experiment and what I expect to see, run the experiment, observe the results, and to think when the results differ from my expectation until I gain an insight or understanding of what is going on. In answer to your question, this Technical Report took me about 3 months.
RS) There are aspiring data scientists and deep learning enthusiasts around the world, as I’ve discovered through fastai. What advice would you give to them? What are 3 things that would be helpful to anyone learning deep learning?
LS) Read, experiment, and think constantly. Read to be aware of what everyone else has already thought of, experiment to gain an intuition of how deep learning works, and think of new ideas or insights. Deep learning is considered a black box so your goal is to transform observation into a deep understanding of why deep learning works.
RS) Who are the researchers in algorithm development that you follow? What are your favorite resources in your field of work?
LS) There are lots of people in deep learning doing great work. Specifically, I will mention Yoshua Bengio, DeepMind, and Google AI research. They all do high quality work and I read almost all of the papers from these teams.
RS) What’s around the curve in your research? Does it get much better than CLR?
LS) I hope it gets better. Currently I am experimenting with an intriguing idea that allows training a deep network with only one image per class. I’ve just implemented the necessary code and my first experiment shows promise. I am planning to write a paper entitled “One shot, semi-supervised training of deep networks” and I hope to have it done in time to submit to the NIPS 2018 conference. The paper submission deadline is May 18 so I have a lot of work to do in the next few weeks.
RS) Algorithm accountability is a trending topic. What are your thoughts on this discussion?
LS) As an amateur on the subject, I personally see merit in accountability but algorithm accountability is a legal question and I think the opinion of legal experts should carry. As a scientist, I strongly believe in reproducible research. I think github.com is an excellent resource, on par with arxiv.org.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” In Advances in neural information processing systems, pp. 1097-1105. 2012.
Smith, Leslie N., and Nicholay Topin. “Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates.” arXiv preprint arXiv:1708.07120 (2017).
Leslie N. Smith received a combined BS and MS degrees in Physical Chemistry from the University of Connecticut in 1976 and a Ph.D. degree in chemical physics from the University of Illinois in 1979.
He is currently at the US Naval Research Laboratory in the Navy Center for Applied Research in Artificial Intelligence. His research focuses on deep learning, machine learning, computer vision, sparse modeling, and compressive sensing. Additionally, his interests includes super resolution, non-linear dimensionality reduction, function approximation, and feature selection.
Reshama Shaikh is a data scientist/statistician and MBA with skills in Python, R and SAS. She worked for over 10 years as a biostatistician in the pharmaceutical industry. She is also an organizer of the meetup group NYU Women in Machine Learning and Data Science. She received her M.S. in statistics from Rutgers University and her M.B.A. from NYU Stern School of Business. Twitter: @reshamas