The winner of the 2016 MLconf Industry Impact Student Research Award, which is sponsored by Google has been announced. Our committee has reviewed several nominees and found Tianqi Chen’s research on XGBoost and MXNet to be the most impactful and interesting for future developments in industry.
Tianqi Chen is the winner of the 2016 MLconf Industry Impact Student Research Award! This announcement was made on Friday, November 11th, 2016 in San Francisco. Tianqi accepted via a video acceptance speech (available here).
In 2015, there were 2 winners of the award, including Viriginia Smith (UC Berkeley) whom presented on November 11, 2016 at MLconf SF and Furong Huang (UC Irvine) whom presented at MLconf NYC in April 2016.
Tianqi has been invited to present his work on XGBoost in Seattle at MLconf in 2017. His advisor, Dr. Carlos Guestrin, has presented at MLconf numerous times as well.
Tianqi works at the intersection of learning and systems. He has built many scalable learning systems. His research focuses on scalable boosted trees and work on a package XGBoost, which is widely used for competitive ML and in the industry for supervised learning problems where you train data to predict another variable because it provides parallelized boosted trees that run in an efficient and accurate way. XGBoost is available in many distributed environments for production such as Hadoop, MPI, SGE, Flink, & Spark, and in many preferred languages such as python, R, Julia, java, scala. The framework constructs tree ensembles. It is not easy to train trees at once, so XGBoost takes an additive model and trains one tree, uses the information from it and adds another tree. Then, after the tree ensembles are created, the model needs to be regularized. First, the complexity is defined in order to regularize the model and better understand what information is being learned. Regularization is one part most tree packages treat less carefully, or ignore. This is because the traditional treatment of tree learning emphasized improving impurity, while complexity control was left to heuristics. By defining it formally, we understand it better and it works well in practice. One can derive a structure score and a goodness-of-fit measure for the tree ensemble.
Tianqi is also well-known for his contribution to work on MXNet. MX stands for mix and minimize and is a dynamic dependency scheduler that automatically parallelizes both declarative and imperative operations. The heart of MXNet is NNMVM an intermediate layer just like LLVM. the abstraction to NNVm allows several just in time code optimization s that significantly boost the performance. MXNet as a competitor to TensorFlow is widely recognized as it has been heavily invested in by Amazon.
Bio:
Tianqi holds a bachelor’s degree in Computer Science from Shanghai Jiao Tong University, where he was a member of ACM Class, now part of Zhiyuan College in SJTU. He did his master’s degree at Changhai Jiao Tong University in China on Apex Data and Knowledge Management before joining the University of Washington as a PhD. He has had several prestigious internships and has been a visiting scholar including: Google on the Brain Team, at Graphlab authoring the boosted tree and neural net toolkit, at Microsoft Research Asia in the Machine Learning Group, and the Digital Enterprise Institute in Galway Ireland. What really excites Tianqi is what processes and goals can be enabled when we bring advanced learning techniques and systems together. He pushes the envelope on deep learning, knowledge transfer and lifelong learning. His PhD is supported by a Google PhD Fellowship.
Books On Display At MLconf San Francisco
Morgan & Claypool:
- Active Learning
- Algorithms for Reinforcement Learning
- Analyzing Analytics
- Automatic Detection of Verbal Deception
- Bayesian Analysis in Natural Language Processing
- Essentials of Game Theory: A Concise Multidisciplinary Introduction
- General Game Playing
- Graph Mining: Laws, Tools, and Case Studies
- Graph-Based Semi-Supervised Learning
- Human Computation
- Introduction to Intelligent Systems in Traffic and Transportation
- Introduction to Semi-Supervised Learning
- Learning to Rank for Information Retrieval and Natural Language Processing, Second Edition
- Learning with Support Vector Machines
- Linguistic Structure Prediction
- Metric Learning
- Mining Latent Entity Structures
- Modeling and Data Mining in Blogosphere
- Natural Language Processing for Social Media
- Probabilistic Approaches to Recommendations
- Robot Learning from Human Teachers
- Semi-Supervised Learning and Domain Adaptation in Natural Language Processing
- Sentiment Analysis and Opinion Mining
- Statistical Language Models for Information Retrieval
- Statistical Relational Artificial Intelligence
- Syntax-based Statistical Machine Translation
- Trading Agents
Use “MLCON16” to save 20% off through December 31st, 2016: http://store.morganclaypool.
Cambridge University Press:
- Agarwal & Chen, Statistical Methods For Recommender Systems
- Bennett & Hugen, Financial Analytics with R
- Efron & Hastie, Computer Age Statistical Inference
- Flach, Machine Learning
- Fouss et al, Algorithms and Models for Network Data and Link Analysis
- Guenin e al, A Gentle Introduction to Optimization
- Leskovec et al, Mining of Massive Data Sets
- Warwick & Shah, Turing’s Imitation Game
Additional Machine Learning & Aritificial Intelligence Books on Display
- How to Create a Mind: The Secret of Human Thought Revealed, Kurzweil, Ray
- The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World, Domingos, Pedro
- Overcomplicated: Technology at the Limits of Comprehension, Arbesman, Samuel
- Artificial Intelligence Simplified: Understanding Basic Concepts, George, Dr Binto
- Rise of the Robots: Technology and the Threat of a Jobless Future, Ford, Martin
- Superintelligence: Paths, Dangers, Strategies, Bostrom, Nick
- Our Final Invention: Artificial Intelligence and the End of the Human Era, Barrat, James
- The Age of Spiritual Machines: When Computers Exceed Human Intelligence, Kurzweil, Ray
- Artificial Intelligence: The Basics, Warwick, Kevin
- The Singularity Is Near: When Humans Transcend Biology, Kurzweil, Ray
- The Inevitable: Understanding the 12 Technological Forces That Will Shape Our Future, Kelly, Kevin
- Our Robots, Ourselves: Robotics and the Myths of Autonomy, Mindell, David A.
MLconf Atlanta Recommended Academic Papers
Hussein Mehana, Director of Engineering, Facebook:
- https://arxiv.org/abs/1502.01710
Patrick Koch, Principal Data Scientist, and Funda Gunes, Sr. Research Statistician Developer, SAS Institute Inc:
1. Bottou, L., Curtis, F. E., Nocedal, J., Optimization Methods for Large-Scale Machine Learning,arXiv:1606.04838 [stat.ML], 2016.
2. Sutskever, I., Martens, J., Dahl, G. and Hinton, G., E. On the importance of initialization and momentum in deep learning, In Proceedings of the 30th international conference on machine learning (ICML-13), Atlanta, GA, pp. 1139–1147, June 2013.
3. Bergstra, J. and Bengio, Y., Random Search for Hyper-Parameter Optimization, J. Machine Learning Research, 13: 281–305, 2012.
4. Sparks, E. R. , Talwalkar, A., Haas, D. , Franklin, M. J., Jordan, M. I., and Kraska, T., Automating Model Search for Large Scale Machine Learning, Proceedings of the Sixth ACM Symposium on Cloud Computing, August 27-29, 2015, Kohala Coast, Hawaii.
5. Local Search Optimization, SAS/OR®
6. SAS® Viya™ Distributed Analytics Platform
Dr. Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology:
- H. Dai, Y. Wang, R. Trivedi and L. Song. Recurrent Coevolutionary Feature Embedding Processes for Recommendation, Recsys Workshop on Deep Learning for Recommendation Systems, 2016. PDF (BEST PAPER) (http://arxiv.org/pdf/1609.
03675.pdf) - H. Dai, B. Dai and L. Song. Discriminative Embeddings of Latent Variable Models for Structured Data, International Conference on Machine Learning (ICML), 2016. PDF (https://arxiv.org/pdf/1603.
05629.pdf) - Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M., and Song, L. Scalable Kernel Methods via Doubly Stochastic Gradients. Neural Information Processing Systems (NIPS 2014). PDF (https://arxiv.org/pdf/1407.
5599.pdf)
Great Machine Learning and Data Science Books to be on Display at MLconf Atlanta
We’re so grateful for the participating publishers that are sending books to be displayed and given away at MLconf Atlanta on Friday! We’re also displaying and giving out a collection of relevant machine learning books! Check them out!
To Win Books: participate in our event-day twitter contest. The most interesting and unique tweets will be awarded with free ML books! Make sure to mention @mlconf and #MLATL to win!
CRC Press:
*For more details, go to their virtual booth: https://www.crcpress.com/go/MLconf2016
- A First Course in Machine Learning, Second Edition
- Text Mining and Visualization: Case Studies Using Open-Source Tools
- Handbook of Big Data
- Accelerating Discovery: Mining Unstructured Information for Hypothesis Generation
- Statistical Learning with Sparsity: The Lasso and Generalizations
- Statistical Reinforcement Learning: Modern Machine Learning Approaches
- High Performance Parallel I/O
- Sparse Modeling: Theory, Algorithms, and Applications
- Computational Trust Models and Machine Learning
- Regularization, Optimization, Kernels, and Support Vector Machines
- Big Data and Social Science: A Practical Guide to Methods and Tools
Cambridge University Press:
- Agarwal/Chen, Statistical Methods for Recommender Systems
- Braun/Murdoch, A First Course in Statistical Programming with R
- Efron/Hastie, Computer Age Statistical Inference
- Flach, Machine Learning
- Fouss, Algorithms and Models for Network Data and Link Analysis
- Leskovec et al, Mining of Massive Data Sets
Springer Publishing:
Additional Machine Learning Books on Display:
- The Seven Pillars of Statistical Wisdom, Stigler, Stephen M.
- Algorithms to Live By: The Computer Science of Human Decisions, Christian, Brian
- Overcomplicated: Technology at the Limits of Comprehension, Arbesman, Samuel
- Naked Statistics: Stripping the Dread from the Data, Wheelan, Charles
- The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World, Domingos, Pedro
- Data Science from Scratch: First Principles with Python, Grus, Joel
- The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy, McGrayne, Sharon Bertsch
- Think Bayes, Allen B. Downey
- How to Create a Mind: The Secret of Human Thought Revealed, Kurzweil, Ray
- Superforecasting: The Art and Science of Prediction, Tetlock, Philip E.
- The End of Average: How We Succeed in a World That Values Sameness, Rose,Todd
Interview with Hussein Mehanna, Engineering Director – Core ML, Facebook
Our past Technical Chair, interviewed Hussein Mehanna, Engineering Director – Core ML, Facebook, regarding his upcoming presentation Applying Deep Learning at Facebook Scale, scheduled for 09/23/16 at MLconf Atlanta.
One of the criticism against deep learning models was the complexity of inference. In your talk you will explain how you reduced the inference time. Does this mean there is no advantage of shallow models versus deep models anymore?
HM) No, I don’t think so. In fact, one of the tricks that are growing in popularity these days is using deep expensive models to learn and then use those to teach shallow models (dark matter transfer). There seems to be a theory that during learning you need more capacity and complexity but that could be reduced at inference. In fact at times it even improves accuracy. So I think shallow models will stay.
Do we need to compromise accuracy to make deep learning inference fast?
HM) Actually not necessarily, at times it may even improve generalization as complex models overfit. That said, figuring out how to reduce the computational load of a model is still non trivial. Making this simpler and more automatic is something that will help the industry.
TensorFlow, Torch, Theano, Mxnet, CNTK,…. Can you help us survive the babel of deep learning platforms? Can you help the MLconf audience what to choose, or how to choose?
HM) Yes, I probably can. That said, MLconf audience should feel happy because diversity increases the chances that they get tools closer to their needs. We are in a creative chaos phase in AI but things are converging.
I have noticed that some deep learning platforms are good with dense data and others with sparse. At Facebook you are dealing with both. How did you manage to unify both under one platform?
HM) Good question – I will need to check with our legal system before I answer that. All I can say now is that we treat both as a first class citizen and we are investing in algorithms that operate in the intersection. This is majorly beneficial for sparse scenarios since traditional deep learning has been dense focused as its easier to get hold of images than social data.
You implemented deep learning at scale. What is the gap between theory and practice? What are the tricks that make the difference that you don’t find written in a paper?
HM) That’s a fantastic question. Any ML algorithm is really dependent on the data. If you change the data, you change the problem completely. That’s the biggest difference in my opinion between Academia and Industry. It makes a lot of sense for academia to standardize their datasets but most of those don’t represent what the industry uses. Think about the intersection in data between the imagenet dataset and a system that needs to recognize consumer products. Probably very different. The other difference is that industrial systems receive continuous improvements that accumulate over time and so baselines in industry are much more tuned.
What was the most surprising fact that you have discovered about deep learning? Can you share a paper with us that had a great influence on you?
HM) I am going to seem biased towards Facebook AI Research a bit but I adore the character level deep learning for NLP. The fact you can learn from raw textual input with no preprocessing as you would with images is just extremely powerful. In my early college days, I just could not bear all the special rules that riddled NLP and I always believed there is a better solution. This paper provides good basis for that, we now have more sophisticated stuff in the team but that paper was a great start.
Hussein Mehanna, Engineering Director – Core ML, Facebook
I am the Director of the Core Machine Learning group at Facebook. Our team focuses on building state of the art ML/AI Platforms combined with applied research in event prediction and text understanding. We work closely with product teams in Ads, Feed, Search, Instagram and others to improve their user experiences.
In 2012, I joined Facebook as the original developer on the Ads ML platform. That quickly developed into a Facebook wide platform serving more than 30+ teams. Prior to Facebook, I worked at Microsoft on Search query alterations and suggestions in Bing and on communication technologies in Lync. I hold a masters degree in Speech Recognition from the University of Cambridge, UK where I worked on noise robustness modeling.
Interview with Dr. Le Song, Assistant Professor in the College of Computing, Georgia Institute of Technology
Our past Technical Chair, interviewed Dr. Le Song, Assistant Professor in the College of Computing, Georgia Institute of Technology, regarding his upcoming presentation Understanding Deep Learning for Big Data, scheduled for 09/23/16 at MLconf Atlanta.
You have done a lot of work on kernel methods. A year ago there was indication that kernel methods are not dead and that could match or outperform deep nets. Is that the case? Is it time for them to retire?
LS) I think kernel methods and deep learning are cousins. The field needs to combine them rather than throwing either away.
In fact, they share lots of similarities, such as both tries to learn nonlinear functions. One can design kernel functions to capture problem structure as much as one can choose the architecture of deep learning according to the problem at hand. More interestingly, the feature maps of kernel functions have one-to-one correspondence with the activation units in deep learning. For instance, the arc-cosine kernel is an infinite combination of rectified linear unit. Hence researchers also call kernel method infinite neural networks.
The main algorithmic difference between kernel methods and deep learning is that the parameters of the kernels are typically fixed and only the classifiers are learned, while the parameters of the activation units in deep learning are learned together with classifiers. This difference allows deep learning models to be more flexible given the same number of parameters. In the face of big data, this also means that deep learning is more scalable given the same model flexibility. Very few works have compared deep learning seriously with kernel methods in big data regime simply because nobody knows how to scale up kernel methods without sacrificing model flexibility. We did some fundamental work in scaling up kernel methods, and we found the matching performance between kernel methods and deep learning. Especially, nowadays when researchers are talking about learning the kernel functions, the line between kernel methods and deep learning really blurs.
The main theoretical difference between kernel methods and deep learning is that there is an almost completely set of theory to explain the behavior of kernel methods, while there is almost no satisfactory theory to explain the behavior of deep learning. When researchers talk about kernel methods, it means not just the set of algorithms, but also the set of unique tools and theories developed to analyze and provide guarantees for kernel methods. Researchers, including us, are working on explaining the generalization ability of deep learning models. In fact, we are now working on using tools from kernels methods to understand the generalization ability of deep learning models.
I think the next big thing will be kernel methods + deep learning + graphical models.
Deep nets have high information capacity versus linear models. If the data is noisy then all this capacity is filled with noise. Are there any practical methods for deciding where it is worth using a deep net versus a simpler model?
LS) Use cross-validation.
Given the success of deep nets, data scientists and researchers always wonder if it is time to give up traditional ML. Is there a reason why we should keep investing on other schools of ML, like bayesian modeling, graphical models, etc?
LS) Definitely, we should keep investing on other schools of ML, like bayesian modeling, graphical models, and we should keep teaching students bayesian modeling and graphical models. Not to mention that the possibility of new and better methods coming up from these other areas, they can simply inspire deep learning model. We recently invented a completely new deep learning architecture based on belief propagation algorithm from graphical models. Now our method is the best method for learning representation for discrete structures, such as taxonomies, drug molecules and social networks.
Given all this research in deep learning there seems to be plethora of architectures. Is there a map or a guideline for choosing the right architecture? Do we just have to follow the rule “do whatever google does”?
LS) A good architecture for a particular problem is heavily based on domain knowledge and previous successful models. For instance, convolution neural networks are based on previous image feature extractors such as pyramid features, recurrent neural networks are based on previous successful sequence models such as hidden Markov models. Google definitely has more computing resources to blindly search for a better architecture for a set of canonical problems, but your problem may be different from these canonical problems.
Neural nets have been there for decades. Why did it take so long to blast off? What do you think people missed in the 80s?
LS) I think neural networks are just more scalable than others nonlinear models out there. For big data, you need both flexibility and scalability. Neural networks have both.
Apart from the deep learning architecture labyrinth, scientists have also to go through the platform labyrinth (tensorflow, mxnet, theano, torch, keras, etc). Do you have any preference, advice?
LS) I prefer tensorflow and mxnet. You can program in python, and easily interface with other programs. mxnet is very efficient and faster than tensorflow. mxnet also has multi-GPU and multi-machine automatic parallelization at high speed.
Dr. Le Song, Assistant Professor in the College of Computing, Georgia Institute of Technology
Le Song is an assistant professor in the College of Computing, Georgia Institute of Technology. He received his Ph.D. in Machine Learning from University of Sydney and NICTA in 2008, and then conducted his post-doctoral research in the Department of Machine Learning, Carnegie Mellon University, between 2008 and 2011. Before he joined Georgia Institute of Technology, he was a research scientist at Google. His principal research direction is machine learning, especially nonlinear methods and probabilistic graphical models for large scale and complex problems, arising from artificial intelligence, social network analysis, healthcare analytics, and other interdisciplinary domains. He is the recipient of the NSF CAREER Award’14, AISTATS’16 Best Student Paper Award, IPDPS’15 Best Paper Award, NIPS’13 Outstanding Paper Award, and ICML’10 Best Paper Award. He has also served as the area chair for leading machine learning conferences such as ICML, NIPS and AISTATS, and action editor for JMLR.