Recently, Himani Agrawal interviewed Edo Liberty, Principal Scientist at AWS and Head of Amazon AI Labs regarding his work in machine learning in industry and academia, as well as his upcoming presentation at MLconf on November 14th.
Abstract:
At AWS, we continue to strive to enable builders to build cutting-edge technologies faster in a secure, reliable, and scalable fashion. Machine learning is one such transformational technology that is top of mind not only for CIOs and CEOs, but also developers and data scientists. November 2017, we launched Amazon SageMaker to make the problem of authoring, training, and hosting ML models easier, faster, and more reliable,. Now, thousands of customers are using Amazon SageMaker and building ML models on top of their data lakes in AWS.
While building Amazon SageMaker and applying it to large-scale machine learning problems, we realized that scalability is one of the key aspects that we need to focus on. So, when designing Amazon SageMaker we took on a challenge: to build machine learning algorithms that can handle an infinite amount of data. What does that even mean though? Clearly, no customer has an infinite amount of data.Nevertheless, for many customers, the amount of data that they have is indistinguishable from infinite.
Bill Simmons, CTO of Dataxu, states, “We process 3 million ad requests a second – 100,000 features per request. That’s 250 trillion ad requests per day. Not your run-of-the-mill data science problem!” For these customers and many more, the notion of “the data” does not exist. It’s not static. Data always keeps being accrued. Their answer to the question “how much data do you have?” is “how much can you handle?”To make things even more challenging, a system that can handle a single large training job is not nearly good enough if training jobs are slow or expensive. Machine learning models are usually trained tens or hundreds of times. During development, many different versions of the eventual training job are run. Then, to choose the best hyperparameters, many training jobs are run simultaneously with slightly different configurations. Finally, re-training is performed every x-many minutes/hours/days to keep the models updated with new data. In fraud or abuse prevention applications, models often need to react to new patterns in minutes or even seconds!To that end, Amazon SageMaker offers machine learning algorithms that train on indistinguishable-from-infinite amounts of data both quickly and cheaply. This sounds like a pipe dream. Nevertheless, this is exactly what we set out to do. This talk lifts the veil on some of the scientific, system design, and engineering decisions we made along the way.
HA) Tell us something about yourself and your work at Amazon.
EL) As a manager, I make sure our lab hires only the absolute best people and positions them to deliver great value for our customers and conduct high quality research. As a scientist, I still conduct some basic research and help shape future products and capabilities for AWS products and services.
HA) You have been a research pioneer in both academia and industry, what are some of the pros and cons of working in academia and industry, pertaining to machine learning research?
EL) I don’t think about it as a pros/cons thing. Academic and industry jobs are different. People enjoy and thrive in different setting. At the core, I think the research itself is similar. But, we also spend time doing other things. In academia, that is running a school. In industry, that is running a software company. If you enjoy developing software, running projects, and working with customers, Industry is a great place for you. If you rather teach, apply for grants, and host students, academia might be a better fit for you. I am personally of the former type (although I also really like teaching…)
HA) You have presented at MLconf NYC twice, while in your position at Yahoo. How will your presentation this year differ from those presentations?
EL) Since joining AWS two years ago, we launched one of the most widely used ML platforms in the world. Working with AWS customers taught us what they really need and value. I look forward to sharing those lessons at MLConf.
HA) From your article (https://aws.amazon.com/blogs/machine-learning/in-the-research-spotlight-edo-liberty/), I came to know that it’s your passion to be a pioneer and conquer new challenges. You are currently working on the Sagemaker platform that runs 100s of different algorithms to support companies of all sizes. I would like to know some of the other open machine learning challenges that you envision to tackle in the future.
EL) As a whole, I think we are only starting to understand how to connect applications to data funnels to machine learning. There are hundreds of products and companies trying to make some headway on this problem. I think we are still very far from having a solution.
HA) I observed that your educational training is rooted in applied mathematics. Furthermore, from the above article, I read that: “Getting as close as possible to what is provably optimal — this is how I measure success for my team. We are operating at the edge of what is theoretically possible.” What is the role of applied mathematics in your day to day work in developing scalable machine learning at AWS. Is your research group also simultaneously working on pushing the limits of what is theoretically possible?
EL) Definitely. Our work relies on very recent results and much of it isn’t even published yet. Regarding the role of math. In machine learning, algorithms behave very differently depending on the data they are presented with. In some sense, you can think of the data being a part of the code/program itself. When building a system, you don’t have the data yet (the customers brings it later). This creates a problem. How can you verify that your code is correct if a part of it (the data) is still missing? That is where math comes in. You should be able to argue and prove mathematically that your algorithm behaves correctly for any While this is often hard to do, it’s critical when building algorithms for tens of thousands of customers.
HA) “If you ask me to write you a song, a melody, the end result isn’t right or wrong. It’s subjective.” I am curious to know some of the challenges in your day to day job whose evaluation is rather subjective. What approach do you take as a mathematician to evaluate the challenge?
EL) There are many different answers to this question and they depend heavily on the context. Usually, I would try to find a measurable metric to act as a surrogate. That might sound simple but it’s harder than it sounds.
HA) Which journal papers and industry conferences do you follow?
EL) Unfortunately, I don’t have time to “follow” conferences, especially now that they have become so large. What I do instead is research specific topics when trying to solve a given problem. This leads me to papers on theory and application in machine learning. Venues usually include SODA, KDD, VLDB, COLT, FOCS, NIPS, ICALP, ESA, and many others. I’m am an eclectic reader.
HA) Who are your heroes/ role models in the field of machine learning?
EL) I’m actually super impressed with many of the young researchers and engineers out there. They are fearless in choosing big ambiguous problems and ferocious in executing on them. I’m very lucky to have many such heros working on my team, all around me at AWS, and as my academic colleagues.
See Edo present at MLconf 2018 this November. Use this link to save 20% on tickets.
About: Edo Liberty
Edo Liberty is a Principal Scientist at AWS and the head of Amazon AI Labs. Edo received his B.Sc. in Physics and Computer Science from Tel Aviv University and his PhD in Computer Science from Yale University, where he was also a postdoctoral fellow in Applied Mathematics. Edo then co-founded and ran a New York based startup. Later, Edo joined Yahoo Research in Israel and taught Data Mining at Tel Aviv university for three years. Before joining Amazon, Edo led Yahoo’s independent Research in New York and Yahoo’s Scalable Machine Learning group. His research interests include data mining, optimization, streaming and online algorithms, machine learning, and numerical linear algebra. He is the author of more than thirty academic papers on these topics including award winning works on streaming matrix approximation and fast random projections. Edo is frequent keynote speaker, tutorial presenter, and committee member at international conferences.