MLconf NYC 2014 is a full-day conference on Machine Learning that was held on Friday, April 11th at The Helen Mills Event Center in New York City. We had presentations from Google, Yahoo, Netflix, Cloudera, Spotify, Oxdata, and many more. Machine learning and big data processing are ubiquitous in the operations of every modern company.
New machine learning algorithms platforms and applications have emerged in industry and academia. Come and learn about the most fascinating advances in ML from the experts. The head of Google Research NY, Corinna Cortes presented the largest scale ML deployment, while Claudia Perlich a serial KDD cup champion discussed her experience in practical machine learning. We hosted a series of talks on bleeding-edge ML-platforms, like: H2o, Logicblox, Cloudera and Ayasdi. Samantha Kleinberg presented her new interesting results with inference from uncertain data in diabetes using continuous sensor data with a tie in to google glass. Animashree Anandkumar introduced the world of tensors, teaching you how you to apply the use of tensors to solving machine learning problems.
Corinna Cortes, Head of Research, Google
Corinna Cortes is the Head of Google Research, NY, where she is working on a broad range of theoretical and applied large-scale machine learning problems. Prior to Google, Corinna spent more than ten years at AT&T Labs – Research, formerly AT&T Bell Labs, where she held a distinguished research position. Corinna’s research work is well-known in particular for her contributions to the theoretical foundations of support vector machines (SVMs), for which she jointly with Vladimir Vapnik received the 2008 Paris Kanellakis Theory and Practice Award, and her work on data-mining in very large data sets for which she was awarded the AT&T Science and Technology Medal in the year 2000. Corinna received her MS degree in Physics from University of Copenhagen and joined AT&T Bell Labs as a researcher in 1989. She received her Ph.D. in computer science from the University of Rochester in 1993. Corinna is also a competitive runner, and a mother of two.
Google’s diverse and ever-growing search domains present new challenges to machine learning. They require designing effective similarity measures that are efficient to compute. In this talk, I will discuss several algorithmic advances and implementation solutions we have designed at Google to handle some of these problems for metric spaces as well as graph-based representations. The talk will discuss machine learning examples from a number of Google applications including Image, YouTube, and Structured Data Search.
Ted Willke, Senior Principal Engineer and Director, Datacenter Software Division, Intel
Ted Willke is the Director of Architecture & Technology for the Datacenter Software Division (DSD) at Intel Corporation. As the division’s “chief technology officer,” Ted is responsible for driving innovation into its big data, cloud computing, and HPC software products. Ted joined DSD this year after transforming his Intel Labs research on large-scale machine learning and data mining systems into a commercial operation. Prior to joining Intel Labs in 2010, Ted spent 12 years developing server I/O technologies and standards within Intel’s other product organizations. He holds a Doctorate in electrical engineering from Columbia University, where he graduated with Distinction. He has authored over 25 papers in book chapters, journals, and conferences, and he holds 10 patents. He won the MASCOTS Best Paper Award in 2013 for his work on Hadoop MapReduce performance modeling and an Intel Achievement Award this year for his work on graph processing systems.
Intel is working with strategic partners to build datacenter software from the silicon up that provides for a wide range of advanced analytics on Apache Hadoop, Apache Spark, and other innovative open source projects. These projects can help data scientists deftly process a wide range of data models, from simple tables to multi-property graphs, using sophisticated machine learning algorithms and data mining techniques. However, technology adoption remains hindered by scarce expertise and laborious workflows, especially when it comes to constructing and analyzing sophisticated models, like large-scale graphs. The Datacenter Software Division is developing the Intel Analytics Toolkit to bring together the best capabilities of various analytics engines, data stores, and ETL processes for data analysts that that are not particularly interested in software engineering or cluster architecture. In this talk, I will describe the Intel Analytics Toolkit, present large-scale graph analytics case studies, and demonstrate how easy it can be to program a commercial data workflow.
Yael Elmatad, Data Scientist at Tapad
Yael Elmatad is a Data Scientist at Tapad. Prior to Tapad, Dr. Elmatad was a Faculty Fellow and Assistant Professor at NYU Physics Department, specializing in the use of high-performance computing to study model space parameter optimization. Ms. Elmatad holds a PhD in Physical Chemistry from University of California, and BS in Mathematics, Computer Science and Hebrew Language from New York University. Abstract: With so many tools available to the modern data scientist, sometimes the distance we put between us and our data can have unintended consequences which often makes our tasks more difficult. In my talk, I will discuss how Tapad uses sources of data to build our Device Graph to better connect with consumers in a unified way across their multiple screens. I will discuss how we moved away from using third party data providers to directly examining the source data to remove hidden biases and unnatural correlations. And how, despite sparsity in our source data, we can use that to better model patterns of use.
With so many tools available to the modern data scientist, sometimes the distance we put between us and our data can have unintended consequences which often makes our tasks more difficult. In my talk, I will discuss how Tapad uses sources of data to build our Device Graph to better connect with consumers in a unified way across their multiple screens. I will discuss how we moved away from using third party data providers to directly examining the source data to remove hidden biases and unnatural correlations. And how, despite sparsity in our source data, we can use that to better model patterns of use.
Edo Liberty, Research Scientist, Yahoo
Before Joining Yahoo! Edo was a post doctoral fellow in the Program in Applied Mathematics at Yale University, where he also received his PhD in Computer Science. Prior to that he received his B.Sc in Physics and Computer Science from Tel Aviv University. Abstract: In this talk I will present the streaming computational model and argue that it is especially suited for large scale machine learning and data mining. I will present, as an example, the well known misra-gries algorithm (a.k.a. frequent items) and an example usage of it in the context of Yahoo Mail. This talk will encourage practitioners to adopt known streaming algorithms and scientists to develop new ones.
In this talk I will present the streaming computational model and argue that it is especially suited for large scale machine learning and data mining. I will present, as an example, the well known misra-gries algorithm (a.k.a. frequent items) and an example usage of it in the context of Yahoo Mail. This talk will encourage practitioners to adopt known streaming algorithms and scientists to develop new ones.
Shan Shan Huang, VP Product Management, LogicBlox, Inc.
As the VP of Product Management at LogicBlox, Shan Shan works on bringing to market a database technology that enables “smart applications”: applications that perform complex and often predictive analysis over a combination of large historical data sets and real-time feeds.
LogicBlox’s mission is to simplify the way we store, process, analyze, and present data in order to build interactive and data intensive applications. The typical application infrastructure today involve a variety of specialized data storage systems, integrated together with ETL tools. The increasing interest in applying machine learning and prescriptive methods to data is adding even more tools — and more integration layers — into this infrastructure. Integration drives up fragility. Integration drives down developer productivity and application performance. LogicBlox tackles this problem from the database angle, by building a unifying database that is capable of handling a wide variety of query workloads: graph, multi-dimensional analytics, and transactional updates. For the right type of applications, LogicBlox can replace a number of specialized databases and the need to integrate them together. In this talk, we discuss some key insights and principles behind the LogicBlox database, as well as its relevance to the data science community.
Pek Lum, Chief Data Scientist, Ayasdi
Pek Lum leads Ayasdi’s products and solutions team. After more than a decade in the life sciences industry, Pek has the passion to help find a cure to cancer and other diseases in her lifetime, and believes that Ayasdi’s technology may help solve these difficult problems. Pek was trained in molecular genetics and cell biology for her Ph.D. at the University of Washington. Her work has been widely published in scientific and medical journals, and her research has contributed to discoveries in drug development and the understanding of complex diseases. Pek spent 10 years at Rosetta Inpharmatics and was part of Merck’s $620M acquisition of Rosetta in 2001, one of the companies that in many circles is known to have started the genetics revolution with its bioinformatics solution for deciphering enormous amounts of gene-expression data to derive insights into numerous potential therapeutic targets for drug discovery. Pek also has a Master of Science in Biochemistry from Hokkaido University in Japan, speaks more than 5 languages, and lives in Palo Alto with her husband and two children.
There is more data than ever in the world today. The sheer amount of data being collected is tremendous and coupled with complexity, has made it almost impossible for current tools and approaches to easily manage. Query-based approaches will reach their limits rather quickly because it is impossible to think of every possible query needed to extract information from such complex data. We cannot query what we don’t know!
Chang Wang, Research Staff Member, IBM Watson
Chang Wang is a research staff member in the Watson Technologies Department at IBM. He is a core member of the team that developed Watson, the machine that beat the best human players on the game show Jeopardy!. Prior to joining IBM, Mr. Wang received his PhD at the University of Massachusetts at Amherst and did his dissertation on the topic of manifold alignment based transfer learning.
There exists a vast amount of knowledge sources and ontologies in the medical domain. Such information is also growing and changing extremely quickly, making the information difficult for people to read, process and remember. The combination of recent developments in information extraction and the availability of unparalleled medical resources thus offers us the unique opportunity to develop new techniques to help healthcare professionals overcome the cognitive challenges they face in differential diagnosis. In this talk, I will present a manifold learning system that IBM Watson team developed for medical relation extraction. Our model is built upon a medical corpus containing 11 gigabyte text and designed to accurately and efficiently detect the key medical relations that can facilitate clinical decision making. To address the big data challenges, our approach integrates domain specific parsing and typing systems, and can utilize labeled as well as unlabeled examples.
Todd Merrill, CTO, HireIQ
Todd Merrill brings operating excellence and a deep technical background to HireIQ. Todd combines executive leadership experience with deep technical knowledge in advanced speech and security technologies. As HireIQ’s CTO, Todd guides the research and development effort responsible for the creation of commercial software applications that use advanced speech technologies to extract emotional characteristics from recorded candidate interviews and agent interactions to determine job fit and predict performance and tenure. Prior to joining HireIQ, Todd served as CEO of Global Crypto Systems, having been a part of the founding team from 2007.Todd‘s background includes notable security companies including ISS, nCircle and AirDefense which gives him a great perspective into the value that innovative uses of advanced technologies can bring to enterprise organizations. Todd holds a B.S. and M. Eng. From University of Florida and holds several US and international patents relating to audio analytics, fundamental cable modem design, as well as steganography and cryptography.
HireIQ helps companies to screen applicants for low-qualification, high-volume positions in call centers. Our customers frequently experience 10% turnover _a month_ and spend $5000/person to replace each employee that turns over. We are developing a novel approach to selecting candidates by extracting emotional content from recorded phone screens and matching outcome data for those candidate who were hired to feed a continuous Machine Learning loop. I will discuss our production environment, early results and the business impacts that we are able to achieve by switching from a traditional statistical modeling shop to a Machine Learning approach.
Josh Wills, Director of Data Science, Cloudera
Josh Wills is Cloudera’s Senior Director of Data Science, working with customers and engineers to develop Hadoop-based solutions across a wide-range of industries. He is the founder and VP of the Apache Crunch project for creating optimized MapReduce pipelines in Java and lead developer of Cloudera ML, a set of open-source libraries and command-line tools for building machine learning models on Hadoop. Prior to joining Cloudera, Josh worked at Google, where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+.
Cloudera’s Data Science Team has a simple mission: build an analytics infrastructure so awesome that it makes Google’s Ads Quality Team seethe with jealousy. To that end, I’ll give an overview of Cloudera’s current data science tools and our future roadmap, including Oryx for building and serving machine learning models, Gertrude for multivariate testing, and Impala for ludicrously high-performance SQL queries against HDFS.
Claudia Perlich, Chief Scientist, Dstillery
Claudia Perlich currently acts as chief scientist at Dstillery (previously m6d) and in this role designs, develops, analyzes, and optimizes the machine learning that drives digital advertising. She has published more than 50 scientific article and holds multiple patents in machine learning. She has won many data mining competitions and best paper awards at KDD and is acting as General Chair for KDD 2014. Before joining m6d in February 2010, Perlich worked in the Predictive Modeling Group at IBM’s T. J. Watson Research Center, concentrating on data analytics and machine learning for complex real-world domains and applications. She holds a PhD in information systems from NYU and an MA in computer science from Colorado University and teaches in the Stern MBA program at NYU.
Display advertising is a great playground for data enthusiasts who want to explore the opportunities and pitfalls of the ‘Big Data’ promise. At Dstillery we collect daily about 10 billion user events, representing the digital and geo-physical journeys of millions of people on desktops, tablets, and mobile phones. This talk will touch on a number of challenges including privacy-preserving representations, robust high-dimensional modeling, large-scale automated learning systems, transfer learning, and fraud detection. Along the lines I will touch on a few higher-level lessons and pose the paradox of big data and predictive modeling: You never have the data you need.
Samantha Kleinberg, Computer Science department, Stevens Institute of Technology
Samantha Kleinberg is an Assistant Professor in the Computer Science departmentat Stevens Institute of Technology.After completing her PhD in Computer Science in 2010 at NYU, she spent two years as a postdoctoral Computing Innovation Fellow at Columbia University, in the Department of Biomedical Informatics. Before that, Ms Kleinberg was an undergraduate at NYU in Computer Science and Physics. Ms Kleinberg’s book, Causality, Probability, and Time, is now available in print and electronically.
One of the key problems we face with the accumulation of massive datasets (such as electronic health records and stock market data) is the transformation of data to actionable knowledge. In order to use the information gained from analyzing these data to intervene to, say, treat patients or create new fiscal policies, we need to know that the relationships we have inferred are causal. Further, we need to know the time over which the relationship takes place in order to know when to intervene. However, these observational data are often noisy and prone to missing values. In this talk I discuss methods for directly incorporating data uncertainty into the inference process and show how this can be applied to mobile sensor data from people with diabetes.
Justin Basilico, Senior Researcher/Engineer in Recommendation Systems at Netflix
Justin Basilico is a Lead Researcher/Engineer in the Algorithms Engineering group at Netflix where he does applied research at the intersection of machine learning, ranking, recommendation, and large-scale software engineering. Prior to Netflix, he worked on machine learning in the Cognitive Systems group at Sandia National Laboratories. He is also the co-creator of the Cognitive Foundry, an open-source software library for building machine learning algorithms and applications.
Building a high-quality recommendation system requires a careful balancing act of handling large volumes of data, choosing and adapting good algorithms, keeping recommendations fresh and accurate, remaining responsive to user actions, and also being flexible to accommodate research and experimentation. In this talk, I will discuss how we use machine learning to drive our recommendation approach. I will describe some of data, models, metrics, and experimental methodology we use to effectively apply machine learning in our product.
Erik Bernhardsson, Engineering Manager, Spotify
Erik Bernhardsson is an Engineering Manager at Spotify, focusing on music discovery and machine learning. He has a master’s degree in Physics from KTH in Stockholm.
Machine learning powers a range of features at Spotify, including the Discover page, the radio feature, and much more. We will go through how Spotify uses collaborative filtering for music recommendations, and in particular how we use a lot of latent factor models. We will also discuss a lot of real-world challenges around how to optimize for user engagement, how to learn from user feedback, and how to scale algorithms to 100s of billions of data points.
Irene Lang, Math Hacker, 0xdata
Irene is the paper and pencil Math Hacker at H2O, providing a bridge between rigorous analytics and analysis at scale. She is responsible for ensuring the mathematical accuracy of the implemented algorithms. In addition to her role as an in-house data scientist, Irene volunteers with local organizations and schools to support traditionally underrepresented students pursuing education in STEM. She holds a B.A. in Economics from Mills College, Oakland.
Anqi Fu, Data Scientist, 0xdata
Anqi Fu is a math hacker at Oxdata, currently working on our R interface for H2O. Ms. Fu graduated with a BS in Electrical Engineering and BA in Economics from University of Maryland, College Park. At Stanford, she graduated with a MA in Business Research from the GSB, and is currently finishing up a MS in statistics.
One of the most frequent questions the math team at 0xdata face is “will this technology replace my data scientists.” The short answer is no. The long answer is the topic of the presentation, where Anqi and I will discuss how two different approaches to the simple question – what movie should we go see together – produce very different answers, and demonstrate how working with big data at scale won’t replace data scientists, but it will make them more productive.
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine
Anima Anandkumar is a faculty at the EECS Dept. at U.C.Irvine since August 2010. Her research interests are in the area of large-scale machine learning and high-dimensional statistics. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She has been a visiting faculty at Microsoft Research New England in 2012 and a postdoctoral researcher at the Stochastic Systems Group at MIT between 2009-2010. She is the recipient of the Microsoft Faculty Fellowship, ARO Young Investigator Award, NSF CAREER Award, and IBM Fran Allen PhD fellowship.
Incorporating latent or hidden variables is a crucial aspect of statistical modeling. I will describe wide applicability of tensor decomposition techniques for learning many popular models including topic models, hidden Markov models, network community models and Gaussian mixtures. In addition to possessing strong theoretical guarantees, the tensor methods are fast, accurate and highly parallelizable in practice.
Xiangrui Meng, Software Engineer, Databricks
Xiangrui Meng is a software engineer at Databricks. His main interests center around developing and implementing scalable algorithms for scientific applications. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression.
Spark is a new cluster computing engine that is rapidly gaining popularity. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. MLlib is a Spark subproject providing machine learning primitives. In this talk, we’ll demonstrate how to use Spark’s high-level API to implement scalable machine learning algorithms, and how MLlib integrates with other components (Streaming, SQL, and GraphX) of the Spark distribution to create practical machine learning pipelines. We’ll also show new features in the upcoming v1.0 release.