Our past Technical Chair, discussed Erin LeDell’s upcoming talk: Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches at MLconf Seattle, scheduled for May 20th.
Xavier Amatriain responded to Pedro Domingos with a tweet, that ensembling is the master algorithm. Would you agree?
EL) If there is one supervised learning algorithm that I’d consider a “master algorithm,” then yes, I would consider ensembling — particularly stacking, also known as “Super Learning”, to be this algorithm. The “best” supervised learning algorithm that exists now or is invented in the future can always be incorporated into a Super Learner to achieve better performance. That’s the power of stacking. I don’t believe that there’s a single algorithm that can consistently outperform all other algorithms on all types of data — we’ve all heard the saying that there’s “there’s no free lunch”; that’s why ensembles are generally more useful than a single algorithm. The “No Free Lunch Theorem” is credited to David Wolpert, the inventor of stacking, incidentally.
The way Pedro Domingos describes the “master algorithm” is more than just a supervised learning algorithm — he talks about the importance of having a feedback loop, or a holistic learning system that includes machine learning algorithms at the center. Then there’s also Artificial General Intelligence (AGI), which is probably more of a master algorithm than anything else. For more information about the current state of AGI, look into the work of Ben Goertzel.
Does ensembling destroy model interpretability? A decision tree offers a nice path to explaining predictions, but a forest?
EL) Ensembles are inherently more complex and opaque than their singleton counterparts. However, anyone who cares about model performance will have a hard time justifying the use of a decision tree or linear model over an ensemble learner.
Some use-cases value model interpretability over model performance and that’s where I’ve seen people make this trade-off. However, there is a lot of work going into black-box model interpretability and I am optimistic about some of the recent advancements in that area. For example, Local Interpretable Model-Agnostic Explanations (LIME), is a method for explaining the predictions of any classifier.
What was the impact of your recent scalable implementation at H2O? What was the speedup versus the legacy implementation? What was the impact for your current customers?
EL) The H2O Ensemble project is a truely scalable ensemble learner because it is built on top of H2O, an open source, scalable, distributed machine learning platform. H2O Ensemble implements the stacking algorithm for big data.
I suppose the “legacy” stacking implementation is the SuperLearner R package. Other than Weka’s Stacking function, the SuperLearner and subsemble R packages contain the only proper stacking implementations that I’m aware of. The speed-up of H2O Ensemble over the SuperLearner package is essentially unlimited because there is no limit to the size of an H2O cluster. There are more details in my dissertation.
I think it’s fairly common for companies to deploy simple linear models into production because it’s the easiest algorithm to scale. Some of the more sophisticated teams deploy a GBM or Random Forest. This leaves a lot of model performance (and associated revenue) on the table. When model performance translates directly to revenue, the advantages of ensembling are very convincing.
Have you ever seen a case where ensembling was not a good choice?
EL) Yes, there are cases where I’ve been able to achieve equal or slightly better performance using a single Gradient Boosting Machine (GBM), for example, than I have with an ensemble containing that same GBM. This means that the GBM is able to approximate the true prediction function well, and I haven’t done a good enough job of creating a diverse set of base learners for my ensemble, or I’ve made a poor choice of metalearning algorithm for the Super Learner.
This situation is far less likely when using stacking (vs some more basic ensembling scheme), however it is important to acknowledge that it happens from time to time. In addition to the causes above, I’ve seen this happen when I’ve spent a lot of time fine-tuning one or more of the constituent algorithms, or when there is not enough training data.
How easy it is to jump from Biostatistics to Data Science? Is the reverse path harder?
EL) For me, there was no division between the two. As a Biostatistics PhD student, I used real-life, clinical datasets as my motivation for developing new algorithms and software, which is very close to what I do in industry now. Not all biostatisticians spend as much time developing software as I did, but I think there is a good overlap between the two. People who chose a Biostatistics PhD program over a Statistics PhD program may be more interested in developing methodology and software for messy real-life data and applications.
As for the reverse path, as long as you have a math/stats background, then I think the transition from Data Science to Biostatistics/Statistics would be easy.
Is the machine learning/data science hype stagnating life sciences by attracting the best talent? Is there a way to steer scientists into curing cancer rather than increase ad clicking?
EL) It’s true that a lot of data scientists come from a pure science background — there are many astronomers, physicists, engineers who have left their respective fields for a career in data science. I have many friends with PhDs (from very good schools!) who can’t find a decent academic or industry job in their area of research. That is a shame, no doubt. The issue is not that the data science hype is causing these folks to leave their fields and enter data science — it’s that our country/society does not prioritize basic research enough to support research-based career paths. Rather than going to work at a consulting firm or applying for their third post-doc, data science is offering these folks a career using many of their highly-valuable skills such as like data analysis and programming (although often in a different subject area).
I think how you apply your data science / machine learning skillset matters a great deal. Some people say that “data is just data” and therefore it doesn’t matter how they apply their skills, but whether that’s a belief you subscribe to or not — everyone is responsible for what they create in this world.
We are at a point where Data Scientists and Computer Scientists hold quite a bit of power and we have an opportunity to decide how and where we’d like to wield this power. Perhaps this will upset some people who work in this industry (or maybe they will even agree with me), but I believe that ad-click related jobs are a horrible waste of our collective human potential.
The healthcare industry, at least in the United States, is one of the least nimble and technologically backwards industries that we have. It is hard to innovate in this area, which might be a reason why smart, talented people stay away. On the other hand, we are starting to see new companies applying machine learning to healthcare in highly innovative and exciting ways. For example, Jeremy Howard’s Enlitic is applying deep learning to medical imaging. The more successful examples that we have of this type of thing, the more that will inspire data scientists to leave their ad-click jobs for careers that lead to curing cancer.

Erin LeDell, Machine Learning Scientist, h2o.ai
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc.
Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from UC Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.