Our past Technical Chair, interviewed Beverly Wright, Executive Director, Business Analytics Center, Georgia Institute of Technology.
There is a lot of talk about using machine learning for different business applications. Why has this become such a popular topic?
BW) There are several reasons why I suspect we’re seeing such an increase in machine learning for business. One of the many reasons I’d attribute to the growth in ML popularity is data:
We’ve seen an incredible increase in the amount of data we can capture and use. We aren’t able to efficiently take on a manual process to explore data but we know that insights are buried within the data. Machine learning can help guide us earlier in the analytics lifecycle to figure out our business questions, and start to give a sense of direction.
Many companies are talking about machine learning, but usage appears to happen at different rates, levels, and intensities. Why are the adoption rates different?
BW) On the surface, using machine learning to improve business decision-making seems like a no-brainer. You might wonder why all companies have not adopted this approach, full force. Before any new technique, process, system, or even mindset is adopted, several challenges and boundaries need to be overcome. Adoption rates most likely vary for a number of reasons, including cultural acceptance, using data to drive decisions, technical expertise, understanding and interpreting results in relevant ways, among many other factors. What might seem like an easy and logical decision from the outside, could take a long time to implement for a variety of reasons.
Where is machine learning going for business? Is this a fad, but we see more, will it change? What do you think the future holds for machine learning in business applications?
BW) I think we’ll see more machine learning applications in business, particularly as large innovative companies have improved and proven the value. The diversity of data structures, coupled with the increased volumes, plus the need for real-time answers, all beg for new ways of harnessing and using data for business decisions.
Dr. Beverly Wright leads the Business Analytics Center at Georgia Institute of Technology’s Scheller College of Business. Beverly brings over twenty years of marketing analytics and insights experience from corporate, consulting and academia. In her consultative roles for both nonprofits and for profit businesses, she has solved critical issues through the use of modeling and advanced analytics. Her academic experience spans over a decade with a strong emphasis toward community engagement and experiential learning. She’s also worked for companies within or leading Marketing Analysis departments.
Beverly earned a PhD in Marketing Analysis, a Master of Science degree in Analytical Methods, and a Bachelor of Business Administration degree in Decision Sciences from Georgia State University. She has also received a Professional Research Certification from the Marketing Research Association and CAP certification from INFORMS. Dr. Beverly Wright regularly presents at professional and academic conferences, as well as publishes articles in various business journals.
Interview with Navdeep Gill, Data Scientist, H2O.ai
Our past Technical Chair, interviewed Navdeep Gill about his thoughts on various Machine Learning techniques, debugging models and visualizing big dimensional data.
What is the challenge in debugging a machine learning model?
NG) Currently, traditional software programming uses Boolean-based logic that can be tested to confirm that the software does what it was designed to do, using tools and methodologies established over the last few decades. In contrast, machine learning is essentially a black box programming method in which computers program themselves with data, producing probabilistic logic that diverges from the true-and-false tests used to verify systems programmed with traditional Boolean logic methods. The challenge here is that the methodology for scaling machine learning verification up to a whole industry is still in progress. We have some clues for how to make it work, but we don’t have the decades of experience that we have in developing and verifying “regular” software.
Instead of traditional test suite assertions that respond with true, false or equal, machine learning test assertions need to respond with assessments. For example, the results of today’s experiment had an AUC of .95 and is consistent with tests run yesterday. Another challenge is that machine learning systems trained on data produced by humans with inherent human biases will duplicate those biases in their models. A method of measuring what these systems do compared to what they were designed to do is needed in order to identify and remove bias.
Furthermore, traditional software is modular, lending to isolation of the input and outputs of each module to identify which one has the bug. However, in machine learning the system has been programmed with data. Any bug will be replicated throughout the system. Changing one thing changes everything. There are techniques for understanding that there is an error, and there are methods for retraining machine learning systems. However, there is no way to fix a single isolated problem, which is another huge challenge in itself.
To paraphrase, a better set of tools is needed. The entire tool set needs to be updated to move forward. People in industry, including at H2O.ai, are working on this exact problem to ensure the models that are built in production are as accurate as possible and produce results that are aligned with the real world.
What are the limitations of explaining a model with histograms, scatterplots?
NG) The limitations come in the dimensionality of datasets and the dimensionality built out in model building. Most machine learning models express interesting interactions among features in an n-dimensional space. Visualizing the effect of one or two variables on an outcome is simple. However, once you go into more than two dimensions we reach a problem with interpretability and approach the infamous curse of dimensionality. For example, tree models and deep learning involve a tremendous amount of interactions between features, and these interactions only become more complex as the feature space increases. This can pose a huge hurdle in terms of scaling these out to traditional visualizations, and thus increase the difficulty of explaining a machine learning model using traditional visualization methodology.
Is tensorboard a step forward in visual debugging of a model?
NG) Yes, I believe so. Tensorboard allows you to visualize the graph itself, which is quite helpful for most users of TensorFlow. When building and debugging new models, it is easy to get lost in the weeds. For me, holding a mental context for a new framework and model I’m building to solve a hard problem is already pretty taxing, so it can be really helpful to inspect a totally different representation of a model; the TensorBoard graph visualization is great for this.
What is the next step for visualizing machine learning?
NG) The next step involves building visualizations that can be used to educate, propose business value and be used in an interactive manner for all machine learning purposes. A great example of this is R2D3, which is a product by Tony Chu from H2O.ai. This visual tutorial walks you through a machine learning use case explaining each step along the way using insightful visuals that can peak the interest of any person in the field of machine learning. These kind of visualizations can also be built out to show stakeholders business value for almost any business problem that requires machine learning. In addition, they can help a residential Data Scientist get insight into their data/machine learning models with ease.
What is the solution to visualizing big high dimensional data?”
NG) Two possible solutions to this problem are dimensionality reduction and feature selection. Dimensionality reduction involves taking your feature space and projecting it down to a lower dimensional space, which can help visualize huge data sets. Some examples of this are PCA analysis, multidimensional scaling and t-SNE, to name a few.
Feature selection allows you to choose the “best” features from your data, which you can visualize. “Best” could be defined based on you cost function or variable importance. For example, you could visualize the top five features that are the most important to your model, or you can use some type of criterion such as Weight of Evidence or Information Value to choose the “most important” variables. In addition to the previous, one can also conduct visuals such as parallel coordinates, scatter plot matrices, Glyphplot’s, Andrew’s plots or Arc-Diagrams.
Navdeep Gill is a Data Scientist at H2O.ai. He graduated from California State University, East Bay with a M.S. degree in Computational Statistics, B.S. in Statistics and a B.A. in Psychology (minor in Mathematics). During his education he discovered an interest in machine learning, time series analysis, statistical computing, data mining and data visualization.
Prior to H2O.ai Navdeep worked at several startups and Cisco Systems, focusing on data science, software development and marketing research. Before that, he was a consultant at FICO working with small- to mid-size banks in the U.S. and South America focusing on risk management across different bank portfolios (car loans, home mortgages and credit cards).
This Week in Data by Braxton Mckee (CEO, Ufora) & MLconf Alumni Speaker, Issue #8
Kick Off:
A major enterprise software company CEO said this in 2008: “The interesting thing about cloud computing is that we’ve redefined cloud computing to include everything that we already do. I can’t think of anything that isn’t cloud computing with all of these announcements. The computer industry is the only industry that is more fashion-driven than women’s fashion. Maybe I’m an idiot, but I have no idea what anyone is talking about. What is it? It’s complete gibberish. It’s insane. When is this idiocy going to stop?”
That was Larry Ellison, CEO of Oracle, in 2008. Fast-forward eight years and Oracle bought NetSuite this week, a company that has cloud computing throughout its DNA. This signals the way cloud-computing is remaking enterprises top to bottom.
Of course, these days it’s not just about putting data in the cloud, it’s about using it in a smart way once it’s there. Even sophisticated companies can take a while to really understand how to use the cloud (apparently it took Oracle years!). I see a lot of companies letting their in-house infrastructure team lead the effort to move to the cloud. This doesn’t produce the best results, however, because the cloud is fundamentally different than that super-expensive datacenter you’ve been paying for: it’s agile and self-service. The whole point is to let your data-scientists and analysts get access to computing power as soon as they need it. So if you’re moving to the cloud, I strongly suggest empowering your data scientists to run their own access to it. I’ve seen several companies through this and happy to talk you through it if your firm is making the leap.
In the News:
You hear a lot about people jurisdiction-hunting for tax havens and places that will let them set up secretive shell companies. Well the same thing is happening in a way with data storage. Companies like Microsoft have been in court arguing that information they are storing is out of reach of the U.S. government, based on where it is stored. Microsoft recently convinced a US court that data it has stored in Ireland is out of U.S. jurisdiction. Whether this ruling is good or bad for privacy remains a hot debate in the industry, with a prominent privacy advocate saying this week that the ruling creates data borders that could be bad.
—
The data integration company Talend IPO’d this week. There are other publicly-traded data companies, like Teradata. This shows the maturation of this part of the tech industry out of strictly start-up form. Teradata also made news with an acquisition of a UK data company.
In Industry:
For years, people in finance have been buzzing about the possibility of using data from around the web for a sort of sentiment analysis that will lead to good stock picks. One company, Kavout, is now merging that sort of information with traditional fundamental datapoints in an algorithm it says leads to better outputs.
Interestingly, Kavout says it’s starting to use “deep learning” to find new trading strategies. I hear this idea coming up more and more, and I’m curious to see how it plays out. I’ll admit I’m a little skeptical – deep learning has made huge strides in voice and image recognition, but the technique requires a huge amount of data to be made to work. As one researcher argues here, if we look at how much data is used in other successful uses of deep learning and translated that into stocks, it would be the equivalent of hundreds of thousands of years of data (which we obviously don’t have).
—
Amazon keeps knocking it out of the park on its earnings, and a lot of the reason is its cloud-computing platform, where scores of companies like mine store their data. But the Amazon’s Prime program is also adding substantially to its earnings and the FT said this week that its use of customer data in Prime is a big part of its success.
Quirky Corner:
Most people try to be fairly open-minded in tech, but still there are some taboo questions. Like sex robots. MIT Technology Review has a full list.
—
Apple announced on Wednesday that it has sold 1 billion iPhones. How does that compare to the number of Playstations, Harry Potter books and “Thriller” albums sold? Interesting compilation of best sellers across a variety of categories here.
—
Ray Kurzweil, my favorite font of crazy futurist predictions, gave a keynote speech at a Seattle mobile technology conference where he argued that we are currently in by far the best time in human history. His point: we have unprecedented access to data about what’s going on in the world than we ever have before. As our access to information about bad events increases, our perception is that things are getting worse, even though things are actually improving. Personally, I think it’s the job of the data industry to help us move beyond sensationalism and really understand what the mass of data is telling us.
Braxton McKee is the technical lead and founder of Ufora, a software company that has built an adaptively distributed, implicitly parallel runtime. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.
This Week in Data by Braxton Mckee (CEO, Ufora) & MLconf Alumni Speaker, Issue #7
Kick Off:
“Overcomplicated.” Do you ever feel that way about our increasingly tech-driven world? That’s the name of a new book I read this week by a complexity scientist named Samuel Arbesman. It’s a good read, accessible to anyone. Beyond giving good explanations of basic programming concepts like recursion (when code refers to itself), it walks through scores of past examples of when models broke down and man couldn’t understand why.
The reason many of our systems have become too complicated for any human to understand is that their creators fidget with them until they work but don’t quite understand why it worked in the end. I can’t think of a better example of this than the new algorithms underpinning the AI resurgence: deep learning. These new systems are grown and trained – they aren’t designed. So they can have really unexpected behaviors outside of the ones they are intended for. I don’t think we should be overly alarmist about all this, but it’s important to understand this as we plow forward with advancements in this area.
In the News:
Workday acquired the big data company Platfora, which built a data analytics platform on top of Hadoop. These sorts of acquisitions show that an entire mini-economy has been built around Hadoop, an open-source storage and processing platform for large data sets. I would caution people from getting more wrapped up in Hadoop because of more advanced open-source systems, like Spark, that have come along. This tends to happen with industry and computing — as soon as industry figures out a big change and adapts, the developers are onto something much better.
There is so much data out there now that researchers are constantly trying to combine data sets to look for causal effects. But that can’t always accurately be done. Some computer scientists at UCLA and Purdue have developed a mathematical tool called a structural causal model, which figures out how information from one source can be combined with data from another source. As they explained, it’s “like putting together a jigsaw puzzle using pieces that were produced by different manufacturers.” Technical write up here. Layman write up here. Cool stuff.
More new hardware out from Nvidia. The chip maker continues to make investments in new GPUs. It’s well known that that GPUs, or graphics processors, have been a primary catalyst of the resurgence in AI. But I was surprised to hear that Nvidia announced this GPU at an AI meetup, and the new chip has specific instructions designed for deep learning. It’s a sign that AI may come to dominate graphics processing units’ future more than its early driver, video gaming. At what point do we start calling it an AIPU?
Autopilot saved a pedestrian life in DC this week. People can worry all they want about self-driving cars, but there is technology coming out of the push to self-driving cars that will benefit society, even if we keep drivers behind the wheel.
In Industry:
It is super prestigious to become a doctor, but apparently many doctors now spend several hours a day entering data about their patients. Interesting how bringing technology into a field can reshape jobs and may sometimes mean that new data-clerk jobs need to be created.
So Spotify is now offering advertisers targeted ads, using the data about your listening patterns. But I’m not sure how someone buying a 15-second audio ad spot would figure out whether someone who listens to smooth jazz is more desirable than someone who, say, listens to heavy metal.
Data centers are known to be big energy wasters. So it’s cool to see Google testing AI systems to improve the energy efficiency of its data centers. There’s something very meta here: the data centers are where AI gets trained and now AI can help them be more energy efficient, all the while creating more data to store.
Quirky Corner:
Not sure if any of you are playing Pokemon, but beyond its goofy fun is some seriously advanced use of mapping technology. Here’s a good Bloomberg story on the start-up behind that.
Love this video of Facebook’s solar powered, Internet-beaming plane.
Last, if you’re spending time at the beach this summer and you want to know what’s across the ocean from you, take a look at these maps. It’s not always what you think.
Braxton McKee is the technical lead and founder of Ufora, a software company that has built an adaptively distributed, implicitly parallel runtime. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.
MLconf Industry Impact Student Research Award, Sponsored by Google
Last year, we started a new award program called the MLconf Industry Impact Student Research Award, which is sponsored by Google. Our committee of distinguished ML professionals reviewed several nominations sent in from members of the MLconf community. There were several great researchers that were nominated and the committee arrived at awarding 2 students whose work, they believe, has the potential to disrupt the industry in the future. The two winners that were announced at MLconf SF 2015 were UC Irvine Student, Furong Huang and UC Berkeley Student, Virginia Smith.
We’ve partnered with Google again to host the award for 2016. This year, we’re offering more direction to members for the community, regarding the areas of research that we’re interested in. Below is a list of some of the topics we’re interested in hearing about. Topics include, but certainly are not limited to:
- Natural Language Processing
- Deep Learning
- Sketching Randomized Algorithms
- Game Theory
- Community Detection
- Large-Scale Clustering
- Time Series
- Image Analysis
- Bayesian Non-Parametrics
- Topic Models
- Probabilistic Programming
We’re interested in other areas of research – please don’t let this short list prevent a submission of interesting work. If you have any questions, please email submissions@mlconf.com.
Submissions will be accepted until Friday, October 28th, 2016. The winner of the award will be announced at MLconf SF on 11/11/2016. Recommend someone using this form!
This Week in Data by Braxton Mckee (CEO, Ufora) & MLconf Alumni Speaker, Issue #6
Kick Off:
One of the most important ingredients to making a new consumer app is data about your customers. Data about users’ preferences and past behaviours allows the app to be not just useful, but personalized. For a long time, developers could get the data they needed from other web properties. But as the social media landscape matures and consolidates, the big properties are becoming increasingly tight-fisted with their data. As Ben Schippers writes in an interesting TechCrunch story, “Sure, you can have the data at desired speed, but once you hit the API threshold, the feed will go from Niagara Falls to a leaky sink faucet.” The big players need developers to build apps on their platform to maintain their monopoly. So they promise developers access to rich data to lure them in, only to throw the walls up when they’re finally being successful. I’ve rarely seen as clear an exposition of how the data you and I produce as we use the internet has become a currency, and more importantly, how that currency is controlled by an oligopoly. Ben’s piece is worth reading, and as you do, think about who should own your data, because it’s not an easy question.
This Week:
The battle between robots and humans is coming, and apparently we (collectively) have buried our heads in the sand. Here’s an interesting new chart of a Pew Research study that finds that two-third of Americans believe robots will soon do most of the work humans currently do. The catch? Eighty percent of Americans think that won’t eliminate their own jobs. These are interesting questions to put to ourselves — will robots replace lots of work? And why would your job be protected?
——
MIT Technology Review has a good inside look at the servers Facebook uses to solve A.I. problems. Facebook, like Google and lots of other companies, is using huge farms of GPU chips to train its neural networks (the key component in all the new AI solutions). One thing that’s cool is that Facebook has been open sourcing its server designs so others can use them (and improve them). The story notes that Facebook is looking into making its own chips, too, something that Google already does. Making a new chip is incredibly expensive – so this is a sign of how committed Facebook is to AI.
In Industry:
Putting the right story in front of the right readers is the name of the business model of many media companies today. A new tool called Multiworld Testing, created by Microsoft researchers could help many media companies. It implements a technique called Reinforcement Learning, where a system tries to learn what actions to take to get a reward. A great example of this problem is to decide which article a web site, like WSJ.com or the NYTimes.com, should show you next. It can play it safe and show you stories that their past data indicate you will definitely like. But then the sites do not get to test out stories that vary much from that and they miss out on finding about your other interests. Many media companies are using traditional A/B testing, but Microsoft’s approach has already provided a 25% lift on clicks for suggested stories in Microsoft News. (Fairly technical write-up here posted by John Langford, an extremely smart researcher at Microsoft.)
——
Keep your driver’s license handy. This week, Consumer Reports warned that society is not ready for autopilot. The extremely limited data suggests that the accident rate under autopilot is around 1 accident per 130 million miles while for a human being driving, it is around 1 per 60 million miles. That makes it seem like the Tesla autopilot is actually better than a person – certainly it’s not obviously much worse. So is Consumer Reports being overly cautious? I think so. And I suspect that a lot of future innovations from AI will face this same problem: we will naturally hold automated systems to a much higher standard than humans. Mostly, this is because preventing accidents requires us to restrict people’s behavior, which they don’t like. Restricting an AI’s behavior is much easier, and AI’s don’t get a vote. At least not yet.
——
Computer vision is undergoing a renaissance. What used to be used for mundane tasks like reading the amount written on a check in an ATM is now fueling new robotics, facial recognition in social media, and driverless cars. I’ve always been interested in the subject and I just ran across this slide deck by deep learning pioneer Yann LeCun (who now works at Facebook). It’s from a speech he gave at the 2015 Computer Vision Conference, but it’s a great survey of both the history of computer vision and current research and still worth reading through today!
Quirky Corner:
It’s fun to see people talk about how data can be used in sports, like to get the perfect golf swing or to track all the riders in the Tour de France.
Google’s DeepMind, the team behind AlphaGo, which beat a human grandmaster in Go for the first time earlier this year, is working on building an AI agent as smart as a rat. It’s a short video, and it’s interesting to hear how they think about testing their AI rat. Hint: pick a rat-behavior study from the 60’s and replace “cheese” with something more palatable to an AI.
What’s Happening at Ufora:
I spoke this past week on a panel called “Datafication in Asset Management” moderated by my friend Peter Knez. It was a lively discussion, and the audience had some great questions about collecting and using alternative datasets as part of an asset management strategy. In particular, someone asked about the legality of building bots to pull data out of websites that require you to login. Answer: probably OK, but much of the law is largely untested here.
I’ve been giving a lot of presentations lately and I think this is a good write-up on how to effectively present on data.
Braxton McKee is the technical lead and founder of Ufora, a software company that has built an adaptively distributed, implicitly parallel runtime. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.