Guest Blog

Machine Learning with a Twist: Detecting Structural Differences in Complex Data

Supervised learning, the most developed form of machine learning, starts with the preparation of labeled training data that are rich and representative (e.g., a large number of typical “spam” or “normal” emails). The machine teaches itself with these examples, such that it can label new data (i.e., new emails) that are unseen but similar to the training data. If the new data turned out to be different (say, spams with other patterns), typical advice would be to enrich the training set, such that it reflects the real process that generates the data we would like to label.

The above approach works that way because the ultimate goal there is to automate the labeling of new data, as accurately as possible. But, accurate automation on new data is not always the goal. What if, instead, we are interested in the discrepancy between the training data and new data?

In a new research paper, “Reading China: Predicting Policy Change with Machine Learning,” we show that machine learning, with a twist, can uncover structural changes underlying complex data. We train a neural network algorithm that “reads” the People’s Daily, the mouthpiece of the Communist Party of China, and classifies whether each article appears on the front page — part of the People’s Daily editor’s job. It turns out that such an algorithm can be used to detect changes in the newspaper’s issue priorities, which, in turn, have profound implications for China’s government policies.

The algorithm tries to mimic the mind of an avid People’s Daily reader who reads its articles and tries to figure out how its editor prioritizes them. If the reader had digested, say, five years’ worth of articles, they would be able to say, with a certain degree of confidence, what’s on the editor’s mind and what kind of articles do or do not “deserve” the front-page status.

But what we are really interested in is whether such an editorial paradigm has shifted over time. To answer this question, suppose the reader applies the paradigm acquired from the five years of observations (the training data) on the articles published in the next quarter (the new data). The reader may be surprised by the new articles; their educated guess about the front page may turn out exceptionally good or particularly poor — a signal of change. While a small surprise may well be taken as noise, a strong signal would convince the reader that their understanding of the editor’s mind is no longer valid and that the priorities of the People’s Daily have fundamentally changed.

We construct a quarterly indicator, which we call the Policy Change Index (PCI) of China, that captures the amount of surprise the algorithm has in each quarter, compared to the previous five years. The key to the design of the PCI is the realization that changes in underlying patterns come from the discrepancy — not the consistency — between what the algorithm has learned and what it will encounter.

The namesake of the indicator comes from the fact that detecting changes in the newspaper’s priorities allows us to predict changes in the Chinese government’s policies. This is because the People’s Daily is at the nerve center of China’s propaganda system, an essential function of which is to mobilize resources to attain the government’s policy goals. Moreover, before major policy changes are made, the government often finds it necessary to justify to or convince the public that those changes are the right moves for the country. Hence, while our algorithm is detecting propaganda change in real time, it’s really predicting policy changes for the future.

When put to the test against the ground truth — policy changes in China that did occur in the past — the PCI could have correctly predicted many of the most critical junctures in the history of China’s economy and reforms. These major changes include the beginning of the Great Leap Forward in 1958, that of the economic reform program in 1978, and, more recently, a reform speed-up in 1993 and a reform slow-down in 2005, among others.

Once we allow for the discrepancy between the training data and new data, the machine learning tools open the door to a variety of applications where detecting structural changes in complex data is the central question.

The interested reader can find more details about China’s policy changes, our methodology, and its potential applications in our research paper or the website of our project. We have also released the source code of the project on GitHub, so that the academic, business, and policy communities can not only replicate our findings but also apply our method in other contexts.

Julian TszKin Chan

Let’s Talk Bayesian Optimization

As a machine learning practitioner, “Bayesian optimization” has always been equivalent to “magical unicorn” that would transform my models into super-models. So off I went to understand the magic that is Bayesian optimization and, through the process, connect the dots between hyperparameters and performance.

Through hyperparameter optimization, a practitioner identifies free parameters in the model that can be tuned to achieve better model performance. There are a few commonly used methods: hand-tuning, grid search, random search, evolutionary algorithms and Bayesian optimization. Hand-tuning is a manual guess-test-revise process that relies on a practitioner’s previous experience and knowledge. A grid search exhaustively explores configurations until a reasonable accuracy has been reached. Random search uses random samplings of all possible combinations of all hyperparameters. And evolutionary algorithms use the comparison of “mutation-like” configurations with the best performing configurations to iterate on the model parameters. The goal of each of these methods is to find the global optima of hyperparameter values.

The problem, however, is that each of these methods has a fatal flaw. Hand-tuning results may not be reproducible from practitioner to practitioner, and the process is difficult to scale. Grid search cannot efficiently optimize models with more than 4 dimensions, due to the curse of dimensionality [5]. Although random search is more capable than grid search, its naive approach is still both time-consuming and expensive and is more likely to settle in a local optima. Furthermore, it is unscalable beyond 10 parameters [5]. Evolutionary algorithms are great if you have access to nearly unlimited compute that can be run in parallel, but is often difficult to implement if you do not.

Bayesian optimization democratizes access to these algorithmic superpowers by relaxing each of these constraints. Originally popularized as a way to break free from the grid, Bayesian optimization efficiently uncovers the global maxima of a black-box function in the defined parameter space. In the context of hyperparameter optimization, this unknown function can be the objective function, accuracy value for a training or validation set, loss value for a training or validation set, entropy gained or lost, AUC for ROC curves, A/B test results, computation cost per epoch, model size, reward amount for reinforcement learning, among others. Here is a quick visual summary of how Bayesian optimization works:

Step 1:

Initialize process by sampling the hyperparameter space either randomly or low-discrepancy sequencing and getting these observations [3].

Step 2:

Build a probabilistic model (surrogate model) to approximate the true function based on given hyperparameter values and their associated output values (observations). In this case, fit a Gaussian process to the observed data from step 1. Use the mean from the Gaussian process as the function most likely to model the black box function.

Step 3:

Use the maximal location of the acquisition function to figure out where to sample next in the hyperparameter space. Acquisition functions play with the trade-off of exploiting a known maxima and exploring uncertain locations in the hyperparameter space. Different acquisition functions take different approaches to defining exploration and exploitation.

Step 4:

Get an observation of the black box function given the newly sampled hyperparameter points. Add observations to the set of observed data.

This process (Steps 2-4) repeats until a maximum number of iterations is met. By iterating through the method explicated above, Bayesian optimization effectively searches the hyperparameter space while homing in on the global optima. [6, 1, 2].

There are a variety of attributes of Bayesian optimization that distinguish it from other methods. In particular, Bayesian optimization is the only method that

Adaptively and intelligently explores a hyperparameter space with optimal learning.
Explore and exploit to find global optima [6, 2].
Robustly handles noisy data.
Optimally searches non-continuous and irregular spaces.
Efficiently scales with hyperparameter domain.
Maximizes utilization of computing resources.

How does this work in practice? Consider a CNN architecture for sentiment analysis [4]. This experiment tunes hyperparameters for each component of the modeling process from preprocessing (embedding dimension) to architecture (filter sizes and feature maps) to training (learning rate and dropout), and compares the results of random search, grid search, and SigOpt’s proprietary set of Bayesian optimization algorithms. The results are as follows:

Experiment Type	Accuracy	Trials	Epochs	CPU Time	CPU Cost	GPU Time	GPU Cost
Default (No tuning)	75.70	1	50	2 hours	$1.82	0 hours	$0.04
Grid Search (SGD Only)	79.30	729	38394	64 days	$1401.38	32 hours	$27.58
Random Search (SGD Only)	79.94	2400	127092	214 days	$4638.86	106 hours	$91.29
Bayesian Optimization Search (SGD Only)	80.40	240	15803	27 days	$576.81	13 hours	$11.35
Grid Search (SGD + Architecture)	Not Feasible	59049	3109914	5255 days	$113511.86	107 days	$2233.95
Random Search (SGD + Architecture)	80.12	4000	208729	353 days	$7618.61	174 hours	$149.94
Bayesian Optimization Search (SGD + Architecture)	81.00	400	30060	51 days	$1097.19	25 hours	$21.59

From the above results, we see that tuning beats no-tuning, with a max of 5.3% increase in accuracy. It is also clear that Bayesian optimization beats grid and random search in accuracy, cost and computation time on both CPU and GPU. Bayesian optimization is able to achieve around a 1-2% boost in accuracy compared to grid and random search for 12%-14% the cost of random search on CPU and GPU. Furthermore, Bayesian optimization arrives at the global optima in a fraction of the time, allowing you to test out more models and architectures. This is only one data point, but we see this pattern of Bayesian optimization being more skillful and powerful replicated in scenarios such as regression models, reinforcement learning, unsupervised learning, and deep learning. Here is more information on common problems with hyperparameter optimization and how to evaluate hyperparameter optimization strategies.

Now that I’ve convinced you to use Bayesian optimization for your hyperparameter optimization, here are some tips to make Bayesian optimization easy.

Choose the right metric or metrics to optimize
Engineer around parameterization for easy tuning
Consider tuning as an integral part of your ML workflow instead of a final step of your modeling

At SigOpt, we’ve created an easy to integrate and easy to use API that also supports advanced Bayesian optimization methods such as constraints, conditional parameters, multi-metric optimization, and parallelism. Happy modeling!

References

[1] E. Brochu, V.M. Cora, N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010.
[2] P. Frazier. Bayesian Optimization. Recent Advances in Optimization and Modeling of Contemporary Problems, October 2018.
[3] M. W. Hoffman, B. Shahriari. Modular mechanisms for bayesian optimization. In NIPS Workshop on Bayesian Optimization, 2014.
[4] Y. Kim. Convolutional Neural Networks for Sentence Classification. In Proceedings of ACL 2014. June 2014.
[5] I. Dewancker, M. McCourt, S.Clark, P. Hayes, A. Johnson, G. Ke. A Stratified Analysis of Bayesian Optimization Methods. arXiv:1612.04451. December 2016.
[6] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, Jan 2016.

Thank you to essential contributions from Harvey Cheng, Alexandra Johnson, Mike McCourt, Kevin Tee.

Guest Post by Dr. Joonyoung Kim: Orchestrating the Next Generation of Machine Learning

In the history of computing, software engineers have always been constrained from innovating to their full potential due to compute hardware performance limitations, despite remarkable growth in hardware performance. Performance bottlenecks include the slowing growth of CPUs, memory bandwidth and capacity, I/O speed and network throughput and latency just to name a few.

Figure 1

From 1970 to 2008, computing performance experienced a 10,000-fold improvement, essentially doubling every two years and widely known as Moore’s Law, see Figure 1. However, from 2008 onward, the industry experienced just an 8x to 10x performance increase. This while rapidly developing machine learning and deep learning applications were coming on board with extensive data sets and their computationally intensive models. Throughout the history of computing, hardware has never been fast enough for software engineers. Now, machine learning engineers are facing the same challenge. There have been times when some hardware performance (whether it was CPU clock frequency, DRAM bandwidth and HDD and flash capacity) seemed sufficient for existing and known workloads but the industry continues to innovate by coming up with different ways to solve problems that used to be considered intractable. And innovation means that the next greatest thing tends to overwhelm current hardware capacity.

Deep learning models, as an example, require terabytes or even petabytes of data to train them and address the hundreds of thousands of requests per second from billions of users with less than tens of millisecond response time if you look at the workloads of hyper-scale data center operators who develop and deploy more and more machine learning models every day.

There are other issues as well. With so many different (heterogeneous) processing units in our era of deep learning, the ubiquitous CPU is no longer dominant due to the emergence of other silicon-based processors and accelerators. The era of heterogeneous computing has been discussed for a long time in the academia but with deep learning going mainstream and growing rapidly, we are finally seeing that these various chips (FPGA, ASIC, GPU, to name a few) fulfilling the requirements of different workloads.

While integrated circuits have been in development since the 1950s, the Intel 4004, a 4-bit central processing unit (CPU), was released by Intel Corporation in 1971. It was the first commercially available microprocessor by Intel, and the first in a long line of Intel CPUs. The initial ASICs used gate array technology and were introduced in 1981 and 1982. FPGAs, led by Xilinx, followed in 1985. GPUs, specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images, were popularized by Nvidia in 1999 although the term had been in use since at least the 1980s.

All of these solutions have been carving out niches and driving innovation in various markets. That’s the good news. The bad news is that machine learning engineers who are tasked with developing their models on top of frameworks such as Caffe, Tensorflow and PyTorch, do not have the expertise nor time to decide what hardware platform their workloads should run on.

This is why machine learning engineers need a layer of abstraction for different platforms and this is where the concept of orchestration comes into play. ML engineers need to be able to launch their jobs, with an orchestration layer handling the rest by picking the best hardware resource for the right job. Overall efficiency and utilization would be maximized and overall TCO (total cost of ownership) would be reduced.

With respect to efficiency, many applications have had to rely on traditional scaling which is structured simply and designed around maximum demand. This method is designed to have the application running all the time and at maximum capacity, which, as previously noted, is inefficient. Time-based scaling represents an improvement over traditional scaling because it is designed for maximum use during peak times of the day, but it still has a fairly high TCO.

Real time scaling, however, is designed for minimum demand. Using this approach, different algorithms can be dynamically loaded and provisioned so that the system is off doing other tasks until needed. This keeps utilization low and offers a much higher levels of efficiency and TCO allowing ‘spare’ compute power to be utilized in other applications when not needed. Please see Figure 2.

Figure 2

Now I want to discuss the benefits of FPGA-based acceleration for machine learning. First, there are many problems that can benefit from FPGA-acceleration as opposed to running on CPU. These are generally data crunching algorithms and include, but are not limited to, compression, encryption, convolution, de-duplication. There have been many machine learning accelerators developed on FPGAs and the innovation is just starting. Second, FPGA offers a high degree of customization opportunity. For instance, unlike CPUs and GPUs where data-path width is fixed to a power of 2 (32 bits, 16 bits, 8 bits), a custom precision such as FP11 (Floating Point 11 bits) can easily be supported in FPGA and consume just enough resource to support the precision.

Third, with relatively new technology like partial reconfiguration, it is now possible to switch the contents of FPGA in a matter of a second from a host application. This allows the orchestration layer to dynamically provision the FPGA with different images for different problem domain depending on workload requirements. Contrast this to building a custom ASIC – even though ASIC can deliver better performance for a specific problem, FPGA-based acceleration can offer ultimate flexibility and does not require multi-year investment, let alone development cost which runs in tens of millions easily, in designing and manufacturing the ASIC.

Also, recent advances in design methodology such as OpenCL and HLS (high-level synthesis), coupled with traditional RTL design methodology make it possible for us to quickly design and deploy the right acceleration solution on any given FPGA platform based on customer needs. This new methodology really allows us to explore different design space and helps us come up with optimal architecture for the problem we need to solve and does not require so many verification steps compared to ASIC design flow. If you look at the time and effort that is spent on modern-day ASIC design flow, the majority of it is spent on not architecting and designing the chip but in verifying it because the cost of a silicon escape and resulting silicon re-spin is so onerous.

In closing, I hope this blog helps you understand the massive infrastructure challenge for machine learning acceleration. Well-designed machine learning acceleration system including orchestration layer and underlying FPGA hardware and contents can significantly increase overall performance and capability that is available to machine learning engineers. This, in turn, will help them design and deploy next generation machine learning algorithms to advance the state of machine learning capability.

Interview with Cassie Kozyrkov, Chief Decision Scientist at Google

One of our MLConf Program Committee members, Reshama Shaikh, recently interviewed Cassie Kozyrkov, Chief Decision Scientist at Google.

Questions & Answers

RS/ Q1) Tell us briefly about yourself and your work.

CK) I serve as Chief Decision Scientist for Google Cloud. For most of my career, I had job titles related to the data sciences: analyst, statistician, statistics lecturer, and data scientist. Data is beautiful, but I also believe it needs a reason. For me, that reason is decision-making. It’s through our actions that we affect the world around us. Although I studied statistics both in college and grad school, I worked to augment my training by earning degrees in the decision disciplines: economics, psychology, and neuroscience (and I’m always learning as much as I can about other perspectives on decision-making).
Early in my career, I started taking a decision-oriented approach to my work wherever possible and saw the incredible boost that incorporating wisdom from the social and managerial sciences can bring to applied data science. It turns out that there’s a lot of synergy we miss out on if we put up arbitrary walls around disciplines.

At Google, I had the benefit of an extremely collaborative environment that encouraged me to share what I knew so others could build on it. We didn’t have a name for it when we started, but eventually this approach to applied data science that incorporates the social and managerial sciences would become known to us at Google as Decision Intelligence. Now that I’m with Google Cloud, my mission is to help the whole world make their data even more useful and to share these ideas so everyone can benefit.

RS/ Q2) What are a couple of your favorite projects you have worked on at Google?

CK) Different projects are favorites in different ways. I loved doing statistical consulting within Google because it gave me a broad view of what the company is up to and it’s great fun to be involved in many different things at once. Of course, as a consultant you don’t get the same deep exposure as you do if you are the main specialist working for several years on a single project.

Another favorite was working to measure and eliminate duplicate business listings on Google Maps. This is a fun example because it’s easy to understand – if you use Google Maps, you’d probably not be a fan of seeing your search for “coffee near me” show you 20 slightly different results that all point to the same physical cafe, right? – but it’s pretty hard to define a good metric for scoring the efficacy of the machine learning system we built to remove potential duplicates. If you’d like to experience some applied mind-bending, take a moment to think about how you’d define a score for duplication in a database if the number of duplicates could be any integer. How would you design a sampling scheme to get at that metric?

I also have a special love for data splitting, power analyses, and experiment sizing. Always great fun, as are the data-mining, feature engineering, and applied machine learning projects. Wow, it’s so hard to choose! Data scientists are lucky to have so many fabulous avenues to explore!

But probably best of all was figuring out ways to make data science more useful. From designing data science training for thousands of Googlers to figuring out best practices in process and team composition (i.e. how to pass the baton between team members with diverse skills to make data as useful as possible), the challenge is extremely rewarding.

RS/ Q3) Do you primarily use R, Python or another coding language?

CK) My guess is that people tend to love what they grew up with in college/grad school, and for me that was R. It’s hard not to compare a new language with your comfy one; when I started learning Python, I remember how my jaw dropped at the relative fuss it takes to get the equivalent of R’s one-sided prop.test(). But Pythonistas have their legitimate reasons for singing Python’s praises, one being the scope of what you can do with the language and how efficiently you can do it, especially if your interests span beyond data science. I actually tried to force myself to go cold turkey with R for a year to give myself time to get really good at Python… and broke several months shy of the goal because R was a little too tempting for one of my projects. My verdict? Both languages are great and it’s impolite to bully one another about which is The Chosen Tongue. Let’s not do that! I just happen to be more comfy with one than the other as a result of my statistical upbringing.

RS/ Q4) We met a little over 3 years ago. It is only recently I have seen your blogs popping up in my LinkedIn feed, and I have enjoyed reading them. What motivated you to start blogging?

CK) A colleague of mine suggested I try my hand at it. People had been urging me to write a book for a while now and I figured a blog was a bit like a small version of a book. Testing at small scale before diving into the entire whopper of a thing seems like a good idea in general, especially to the statistically-minded. I had no idea there would be such an incredible response to it. I’m thrilled by the delight that the community has found in it. It’s great motivation to write more!

RS/ Q5) You are passionate about statistics and statistical education, having designed the most popular statistical training course within Google. As a statistician myself, I share your enthusiasm, and I appreciate especially your blogs on statistics. If you could give educators one piece of advice, what would it be?

CK) Be unboring! A principle that guides my teaching is this: once you have earned your students’ attention, then, and only then, have you earned the right to dive into detail. Otherwise the student is just going to forget them and it’s a waste of everybody’s time. So it’s up to educators to make it unboring. If students are only learning something because you told them to learn it, it’s a miserable experience all around. Life is short, why not have more fun with it? Your students will learn more and you might just find you’re enjoying the experience much more too!

RS/ Q6) Is the experience of teaching and learning statistics as daunting as is commonly believed?

CK) The experience of learning statistics shouldn’t be daunting, but perhaps the experience of teaching it should be… because our standards ought to be higher. Understanding the material is not enough for being a great teacher. To really connect with an audience, you got to embody three roles in one: you’ve got to be an expert (to master the material that you are going to share), a teacher (to design the lesson), and an entertainer (to delight students with a performance that makes them eager for more).

RS/ Q7) Your explanations of statistical terms in your blog, Statistics for People in a Hurry, are outstanding. Is there a plan to promote this approach to mainstream statistics education?

CK) First off, I wouldn’t want to assume that my approach is the right approach. What it does embody, though, is a desire to modernize. I’ve never been a fan of teaching “the way it has always been done” – that seems like a flimsy reason for doing anything. But my approach is not the only one that tries to do better and be a little more brain-friendly than the old fashioned textbook-copyout approach. There are many others and I applaud them all for their efforts.
In order for good teaching to be mainstream, society has to value good teachers. If universities prefer to value research abilities above teaching abilities when hiring faculty, I’m sure you can guess the punchline without my saying it…

RS/ Q8) How do you think the American Statistical Association (ASA) could be more involved with the data science / Python / R communities?

CK) I’ve met quite a few statisticians who don’t think they belong in the machine learning discipline. I’m not sure why this happens, but perhaps the ASA might be able to make a difference by highlighting bridges between applied statistics and applied machine learning. After all, if you want to try make a working solution, apply machine learning, but if you want to be sure your working solution actually works, you’re going to need statistics too. In the AI realm, machine learning and statistics are lost without one another. (Ah, and I see the analysts waving frantically at us. Yes, analysts, you’re vital as well, especially if we need to figure out how to zero in on a great solution quickly. Without you, we might end up chasing our tails for several lifetimes without getting anywhere…)
As for involvement in data science / R / Python communities, well, you know my views are even stronger – I’m all for lowering the arbitrary walls we’ve made around not only communities but entire disciplines so everyone can collaborate more. I’m all for more involvement and collaboration all around.

RS/ Q9) What are 3 pieces of advice to offer to aspiring data scientists? What are your favorite resources that you would recommend to them?

CK)

Useful is worth more than complicated.
Data quality is worth more than method quality.
Communication skills are worth more than yet another programming language.

Book and resource recommendations are always hard, since it depends on the style of data guru you want to be.
If you’re interested in analytics and you haven’t heard of Tufte’s books… happy belated birthday to you!
If you’re interested in statistics, please read books on the history of your discipline (The Lady Tasting Tea is nice start) as well as on epistemology. I also suggest working through intro texts in both Frequentist and Bayesian statistics so you see both sides of the philosophical approach. The graduate schools I attended start their budding statisticians on Hoff’s A First Course in Bayesian Statistical Methods and Casella & Berger’s Statistical Inference. I can definitely get behind these great choices!

Now that I’ve been waxing lyrical about textbooks, let’s try a submission in the videos category! Machine learning engineers might enjoy the Google Cloud Platform channel on YouTube for great practical resources. If you’re looking for something non-Google, I’m a fan of Siraj Raval’s videos because he goes to great lengths to entertain while he enlightens. I’m also perennially impressed by DataCamp’s pedagogical approach to teaching data science.
If you’re inclined towards the more researchy side of AI, start with Goodfellow, Bengio, & Courville’s Deep Learning textbook. If you’re a statistician looking for an easy stats-to-ML bridge book, start with Introduction to Statistical Learning by Witten, James, Tibshirani, & Hastie.
Since data science is all three areas (machine learning, statistics, and analytics), if you’re aiming at becoming the real deal, consider browsing them all. Watching and reading is not enough, though. Make sure you get hands-on practice! If you don’t already have a dataset you want to dive into, I recommend trying out Kaggle.

Oh, and don’t forget to learn how the cognitive biases you carry around as a member of the human species trip up your ability to reason about data. Books like How We Know What Isn’t So, Predictably Irrational, and Thinking Fast & Slow will keep you from taking yourself more seriously than you ought to. If you’re keen to go deeper, Glimcher’s Neuroeconomics: Decision Making and the Brain or Camerer’s Behavioral Game Theory might just feel like the best thing since sliced bread.

RS/ aside) Cassella & Berger’s Statistical Inference is one of my favorite books from grad school, and I often reference it.
RS/ Q10) There is a dire need for both experienced data scientists and data science leaders. What can the community do to help realize this outcome?

CK) Leadership in data science takes a double black belt. If you want this job, you need to understand the business (from impact to strategy to organizational politics to resource allocation to people management to all the rest of it) and understand data science as well. Then you have to put these two together. That’s harder than it sounds! The skills are very different and they are rarely developed in a single individual. We need people who are eager to do both and we need to appreciate that part of that is encouraging young minds to consider that joint educational path. As it stands, the emphasis in data science training is skewed towards the technical side. I’m saddened when I hear students pronouncing “soft skills” like it implies “soft-minded.” As a community, we should start consciously discouraging that messaging if we want to make progress. To invest in a new batch of data science leaders, we need to attract them to the idea of earning both black belts. This means we need an alternative definition of what it means to be impressive in data science. Let’s convince the new crop that it’s not just about how deep their measure theory goes.

RS/ Q11) Here are a few of the “hot topics” in data science. What are your high-level thoughts, in 1-2 sentences, for each topic.

CK)

Algorithm reliability

“The world represented by your training data is the only world you can expect to succeed in.”
“Data are not magic. To learn from examples, the examples have to be good.”

Reproducibility

“If you keep putting innocent hypotheses on trial, eventually one’s going to look guilty. That’s not science, it’s the search for a slice of toast that reminds you of the Mona Lisa. If you only publish the slide with the face and stay silent about the rest of your attempts, don’t be surprised when others fail to reproduce your findings.”

Diversity

“In applied machine learning, it’s a terrible idea to get multiple carbon copies of the same worker. You need a diversity of skills and perspectives to successfully carry out an applied machine learning project. You need different kinds of people with different skills.”
“It is really hard to think of everything alone, we’re not that creative. The more identical your teammate is to you, the less ground your ideas will cover together. Diversity helps creativity flourish.”

Open source

“The community as a whole is so much stronger and more creative than any one of us on our own.”
“The way to get the best possible tool/algorithm is to share your prototype with the community and see how they improve on it.”

Opportunities for entry level data scientists

“Focus less on the title and more on the value that you can add. The world is generating data like never before, so it’s time for all of us to work to make it more useful.”
“Outside research/academic settings, don’t be too focused on complex methods. If you can make your company several million dollars with one histogram, you’re a winner! Do it and be proud of yourself. Save complex solutions for when simple doesn’t work.”

RS/ Q12) When I googled you in preparation for this interview, I saw that you have been speaking and traveling extensively as part of your role as a thought leader in the field. The schedule seems grueling. What advice can you share in effectively managing this schedule?

CK) My secret is that I love traveling. Whenever I board a plane, I feel like I’m getting away with something. I’d better not tell anyone – when the Fun Police find out how much fun I’m having, they might try to stop me! I get a lot of energy from being in a new place where the colors are different, where the air smells different, and, for bonus points, where I can’t understand a thing anyone is saying. That energy helps me keep up an intense schedule.

Being on the road constantly is still exhausting, even if you love it as much as I do. The real secret is knowing your limits and having clear boundaries and priorities. It’s very easy to want to say yes to everything (like answering every single email and message), but it’s important to save your strength for what matters. Focus on the highest priority tasks and forgive yourself for dropping some of the others.

Schedule aside, here are two life-changing words of advice for frequent travelers: compression socks!

RS) Thank you for participating in this interview.

About Cassie Kozyrkov

As Chief Decision Scientist at Google Cloud, Cassie Kozyrkov advises leadership teams on decision process, AI strategy, and building data-driven organizations. She works to democratize statistical thinking and machine learning so that everyone – Google, its customers, the world! – can harness the beauty and power of data. She is the force behind bringing the practice of Decision Intelligence to Google and she has personally trained over 15,000 Googlers in machine learning, statistics, and data-driven decision-making. Before her current role, she served in Google’s Office of the CTO as Chief Data Scientist. Prior to joining Google, Cassie worked as a data scientist and consultant. She holds degrees in mathematical statistics, economics, psychology, and cognitive neuroscience. When she is not working, you are most likely to find Cassie at the theater, in an art museum, exploring the world, or curled up with a good novel.

LinkedIn: Cassie Kozykrov
Twitter: @quaesita
Blog: @kozyrkov
GitHub: @kozyrkov

Guest Blog by Rommel Garcia, Director, Solution Engineering, Southeast, Kinetica

Why GPU-Optimized Databases Are Such a Game Changer for Machine Learning and AI

GPUs are well-suited for the types of vector and matrix operations found in machine learning and deep learning, and they can dramatically reduce the amount of time it takes to train a model.
A distributed, in-memory database accelerated by GPUs leverages the parallel processing power of GPUs to converge ML, Deep Learning, AI, and classic OLAP-type workloads on one powerful platform. In addition, databases like Kinetica with its open framework make it possible for machine learning and artificial intelligence libraries such as TensorFlow, BIDMach, Caffe, Torch and others to work directly on data.

Figure 1. GPU Database For OLAP, Geospatial and Data Science

GPUs accelerates simulations typically at 100x faster performance and at 1/10th of the hardware of traditional CPU-based databases.
Here are several reasons why you should choose a GPU-accelerated database for your ML and AI workloads:
1) One single platform. With a GPU-accelerated database, you can use both simple and complex machine learning and deep learning algorithms on one platform. It simplifies your data pipelines by fusing the speed and batch processing layers. Parallel ingest and reduced reliance on indexes mean data is available for query the moment it arrives. Additionally, with UDFs (user-defined functions), data exploration and analytics can all be performed on a single compute-heavy platform. With user-defined functions, custom code can be run directly on the data within the database. This means that you can perform complex queries on billions of rows of data in well under a second, without needing to move data between systems; it can all be done on the same platform that you’re using for traditional analytics and data exploration. Why custom code? Canned algorithms from existing tools don’t cut it anymore. Organizations are heavily investing in a team of data scientists to improve their success rate in the business.
2) Say goodbye to traditional ML modeling. By using a GPU-accelerated database, you don’t have to manage flat files in your file system; and keep in mind there’s no security in these flat files. Multiple replicas of data eliminate single points of failure, and provide an eventually consistent data store with recovery capability. You also don’t need to write intermediate file formats that are frequently seen with Monte Carlo simulations. A GPU database can automate the process.
3) Built-in geospatial capability. If your machine learning/deep learning project requires location-based analytics, a GPU-accelerated database can provide that capability. GPU databases come with a native geospatial and visualization pipeline for rendering large volumes of data over maps. Its built-in geometry engine combines native geospatial functionality, filtering, and GPU acceleration to deliver a database that can work with massive geospatial datasets—all within a single system. Predicting optimal hotel sites, where landslides will occur, flood susceptibility assessment, understanding disease outbreaks and others are typical ML use cases using spatial data.
4) Open framework = multiple options. With a UDF open framework , you can easily incorporate other machine learning platforms. Some GPU databases include a native REST API and associated connectors that provide data scientists with robust tools for interacting with the Spark ML language, Python, TensorFlow, Node, Java, C++, and more.

Figure 2. User Defined Function (UDF) Framework – Bring Your Own Code

5) Significant cost savings. Since a GPU-accelerated database can provide everything on one platform, you will enjoy significant cost savings. There’s no need to purchase a separate database or server as a machine learning/deep learning/geospatial platform—you simply pay for one license to get access to the entire platform.
6) Increased performance/collaboration. Since all data have been cleansed and highly structured in a table, data scientists and analysts can collaborate effectively simplifying the process of data discovery, data processing, and predictive modeling.
7) Simulations are screaming fast. GPU databases are highly distributed and very different from the old legacy architectures. It is built from the ground up to utilize GPU’s thousands of cores as its compute framework.
8) Focus on data value, not ETL. With a GPU database, data scientists don’t need to perform ETL on the data once it’s in the database. This frees up precious time so that data scientists can focus on finding value and insights in the data. In-database AI processing can help data scientists discover patterns and uncover hidden insights in sub-seconds. You’ll be able to run customized, GPU-accelerated algorithms to achieve your objective, whether you’re detecting fraud, providing online recommendation offers, or analyzing data streams.
Code Examples:
Here’s an example of how to construct UDF (user-defined function) code, including a TensorFlow example. After registration, the following code will be running in parallel using multiple threads. It is very easy to get a database table into a Python dataframe or write a Python dataframe to tables.
Example using CPUs:
# The following code is a distributed UDF (user-defined function) Python source code
# This code will be running in parallel with multiple threads
# import libraries.
from kinetica_proc import ProcData
import pandas as pd
import numpy as np
# get reference to input tables and output table.
proc_data = ProcData()
# there is only one table as input which is lci_input defined in lci_create_proc.py
input_data = proc_data.input_data[0]

# create a dataframe from the table
mydf = pd.DataFrame({
‘serialNum’ : input_data[“serialNum”],
‘dist_type’ : input_data[“dist_type”],
‘dist_para1’ : input_data[“dist_para1”],
‘dist_para2’ : input_data[“dist_para2”],
‘dist_para3’ : input_data[“dist_para3”],
‘dist_para4’ : input_data[“dist_para4”],
‘dist_para5’ : input_data[“dits_para5”]
})
if len(mydf) > 0:
serialNum = mydf.serialNum[0]
dist_type = mydf.dist_type[0]
dist_para1 = mydf.dist_para1[0]
dist_para2 = mydf.dist_para2[0]
dist_para3 = mydf.dist_para3[0]
dist_para4 = mydf.dist_para4[0]
dist_para5 = mydf.dist_para5[0]
# for each distribution, generate simulated data
simuData = []
for index, row in mydf.iterrows():
if row[‘dist_type’] == ‘normal’:
simuData = np.append(simuData,np.random.normal(row[‘dist_para1’],row[‘dist_para2’],10000))
# Logic for output table.
#The output table is python_output_01
output_python_01_table = proc_data.output_data[0]
row_total = len(simuData)
output_python_01_table.size = row_total
# map the output columns to lists
col1=output_python_01_table[“serialNum”]
col2=output_python_01_table[“dist_type”]
col3=output_python_01_table[“dist_para1”]
col4=output_python_01_table[“dist_para2”]
col5=output_python_01_table[“dist_para3”]
col6=output_python_01_table[“dist_para4”]
col7=output_python_01_table[“dist_para5”]
col8=output_python_01_table[“simu_result”]
#give values to the lists
for i in range(row_total):
col1[i] = serialNum
col2.append(dist_type)
col3[i] = dist_para1
col4[i] = dist_para2
col5[i] = dist_para3
col6[i] = dist_para4
col7[i] = dist_para5
col8[i] = simuData[i]
# Call API to complete the UDF. Output data will be stored in tables
proc_data.complete()

Example using GPUs:
# The following code is a distributed UDF (user-defined function) Python source code
# This code will be running in parallel with multiple threads
# This code will use GPU to accelerate the calculation
# import libraries.
from kinetica_proc import ProcData
import pandas as pd
import numpy as np
import datetime
import os
import tensorflow as tf
# TensorFlow setup
os.environ[‘TF_CPP_MIN_LOG_LEVEL’] = ‘3’
config = tf.ConfigProto()
config.log_device_placement = True
config.gpu_options.allow_growth = True
# get reference to input tables and output table.
proc_data = ProcData()
# Assign which GPU will be used for each thread
config.gpu_options.visible_device_list= str(int(proc_data.request_info[“rank_number”]) – 1)
# there is only one table as input which is lci_input defined in lci_create_proc.py
input_data = proc_data.input_data[0]
# create a dataframe from the input table
mydf = pd.DataFrame({
‘serialNum’ : input_data[“serialNum”],
‘dist_type’ : input_data[“dist_type”],
‘dist_para1’ : input_data[“dist_para1”],
‘dist_para2’ : input_data[“dist_para2”],
‘dist_para3’ : input_data[“dist_para3”],
‘dist_para4’ : input_data[“dist_para4”],
‘dist_para5’ : input_data[“dits_para5”]
})
if len(mydf) > 0:
serialNum = mydf.serialNum[0]
dist_type = mydf.dist_type[0]
dist_para1 = mydf.dist_para1[0]
dist_para2 = mydf.dist_para2[0]
dist_para3 = mydf.dist_para3[0]
dist_para4 = mydf.dist_para4[0]
dist_para5 = mydf.dist_para5[0]
# for each distribution, generate simulated data
simuData = []
init = tf.global_variables_initializer()
with tf.Session(config=config) as sess:
sess.run(init)
time_step3 = datetime.datetime.now()
#with tf.device(‘/gpu:7’):
simuData = sess.run(tf.random_normal([num_simu,1],mean=dist_para1,stddev=dist_para2,dtype=tf.float32))
#temp code for testing
cnt = len(simuData)
serialNum=[serialNum for _ in range(cnt)]
# Logic for output table.
#The output table is python_output_01
output_python_01_table = proc_data.output_data[0]
output_python_01_table.size = cnt
# map the output columns to lists
col1=output_python_01_table[“serialNum”]
col8=output_python_01_table[“simu_result”]
col1.extend(serialNum)
col8.extend(simuData)
# Call API to complete the UDF. Output data will be stored in tables
proc_data.complete()

Summary
Organizations are infusing machine learning and AI into their applications, and GPUs have opened the door to new ML and AI use cases. With a GPU-accelerated database, you can explore data, formulate hypotheses, and use machine learning to find models that can be used to accomplish a wide variety of objectives, from providing personalized medicine, to financial trading, to detecting security intrusions, and much more.

Rommel Garcia, Director, Solution Engineering, Southeast, Kinetica
Rommel is Director of Solution Engineering, Southeast, based in Atlanta. Rommel came from Hortonworks as a Sr. Solutions Engineer and Security SME and worked there for four years specializing in Big Data platform architecture, security & governance, OLAP/OLTP/search and operations. Prior to Hortonworks, he worked at Liaison Technologies doing B2B, C2B, B2C, B2G, G2B data integration, master data management, tokenization security solutions, and data mapping/transformation. Java is his primary programming language, and he has also used perl, JavaScript, and other scripting languages.
Rommel earned his Masters Degree in Computer Science and his Bachelors Degree in Electronics Engineering.

Guest Blog by Keshav Dhandhania, Co-Founder, Compose Labs

What Data Scientists Should Know About the State of Location Data

Introduction

The increasing availability of continuous, precise, location data will be a game changer in data science, giving rise to new analysis tools and bringing about a wave of collaboration between data scientists and product teams. It will give data scientists yet another way to influence product decisions and lead to better experiences for users. Along with that it will add to their responsibility of helping users without hurting their privacy.
You may remember being prompted by the Uber app, last year, to allow it to track your location in the background. It promised to only collect data for a small window before your ride, but you couldn’t be totally sure. The only reason cited for the request was a marginal improvement in the Uber pickup process. It sounded like a raw deal for users, but apparently, Uber had done its research and knew that many would tap accept.
In being over-eager for background location data, Uber is not alone. The Facebook app requests it to show you nearby friends. Google Maps promises the forgetful to remember where they parked their car. Even dozens of retailers such as Urban Outfitters and Bloomingdale’s, ask for the permission in exchange for pushing coupons of nearby stores. And there’s evidence that people are willing to make the trade of convenience over privacy. Perhaps users aren’t afraid to let apps see where they go because of so-called security fatigue–it’s just too much to think about; maybe they trust certain brands to use their data judiciously; or they think they have nothing to hide.
Many of the apps asking for location data have no malintent, simply using it to enable features that can’t otherwise exist. Starbucks notifies you when you’re near one of their stores; Credit card companies offer fraud protection by making sure you’re present where your credit card is being used. Perhaps most of the time, your data is used for nothing but the advertised features and never leaves your phone. Or the collection is transparently communicated and you’re given a chance to delete it, as is the case with the timeline data Google collects. There are instances, however where historical collection is thinly veiled under the guise of real-time, functional features.
As it becomes easier to put this data to use for better customer engagement, personalization, customer segmentation, ad-targeting, per-user dynamic pricing, and more, we suspect this practice will become more and more common. The trend will create an explosion of this new type of data, give data scientists more powerful tools, and along with it, more responsibility.

Google allows users to view their timeline in detail and delete data retroactively

iOS stores frequently visited locations on device and uses them for ad-targeting and other purposes.

It’s important to think about the opportunities and dangers this presents. If companies, large and small, are asking for location data, we should consider whether it can reveal sensitive information about us that we otherwise wouldn’t consent to share. What are we actually sharing and what are we getting out of it? Is it a stretch to worry about mere coordinates data being used to compromise our privacy? On the other hand, as a company, how can we use location data to build good enough features to make it a fair exchange? Finally, what are big changes coming to the space in the near future? These are questions we’d like to answer here.

Signals Hidden in the Noise

Location data in its simplest form is a continuous stream of latitude and longitude pairs, sent from users’ devices at regular intervals. It’s determined by a phone’s GPS sensor under open sky, or otherwise collected by triangulation through fingerprinting a set of nearby cell towers, wifi access points, or bluetooth-emitting devices that are known to correspond to fixed coordinates. It has properties that make it amenable to analysis: It’s structured, relatively homogenous across users, and usually reliable and not misrepresented like volunteered data can be. Of course, that doesn’t mean that extracting meaningful, human-interpretable signals from the data is easy work. Years ago, we had the opportunity of working in data science teams with access to geotagged data from millions of users, but this data was virtually unusable for many of the use-cases we dreamt up. The cost of making sense of the data was simply too high. Raw data had to be massaged to remove outliers–there can be lots of these, especially when indoors–and the data had to then be accurately clustered, and subsequently turned into actual addresses using a process called reverse geocoding. That way, a series of 50 coordinates can be translated to say, the specific path between a coffee shop and a restaurant and the duration of the dwell time at each location. On top of the difficulties in analysis, sampling the data in the first place is hard: using the GPS chip can drain a phone’s battery–a pet peeve that can lead to your app being deleted.

APIs to the Rescue

Given the daunting challenges of making sense of location data, it may seem as though we shouldn’t be worried or excited about its power in the right or wrong hands. However, new tools are beginning to change that. Throughout the past decade, talented teams have built solutions that power their own location products and are also being made available to other companies large and small.
Let’s go back in time for a minute. You may remember Google’s Latitude app, one of the first location tracking apps with a large audience. When Latitude first launched in 2009, it used a significant amount of battery power for its data collection, and suffered from terribly inaccurate readings, which it failed to correct. It meant the app would, at times, see the user jump to the middle of a body of water and back in a matter of seconds. Through the years, however, these issues were ironed out by improvements in hardware, mobile operating systems, and algorithms to better analyze the data. GPS readings and wifi triangulation became more accurate. To save on power, operating systems started caching location readings and sharing them amongst apps, and apps learned to sample location data only when necessary, for instance remaining idle if the user were stationary.
These improvements have allowed companies to utilize location data to build reliable products and roll them out to a vast number of users. For instance, Google Maps now uses this data to offer real-time and historical foot traffic data for millions of businesses. Foursquare notifies you to automatically check in after detecting the venue you’re visiting. Now, luckily for data scientists like us, these improvements are being rolled out as APIs for outside developers, as well, taking the burden of reinventing the wheel off our shoulders.

Google Maps, for instance, has for a while, provided access to its Geocoding and Places APIs, which allow developers to easily turn coordinates into addresses and business names, and vice versa. Its Roads API enables developers to turn a series of coordinates into a path by snapping them onto existing roads. The APIs are used in many popular apps.

Google’s Roads API determines a user’s driving path from a series of coordinates

A much more interesting development is the launch of Foursquare’s Pilgrim SDK last year as part of Foursquare’s pivot towards B2B products. The mobile SDK is built on top of Foursquare’s years of experience in building its own apps that rely heavily on location understanding. When integrated into an 3rd party app, the SDK gives it a chance to react each time the app user enters or exits their home or workplace–which are inferred by the SDK–or visits any of the 105M+ public places in Foursquare’s database.

3rd party apps using the Foursquare Pilgrim SDK can use visit data to segment users. Image from Foursquare.com

New GPS Chip Redefining Accuracy

Despite these remarkable tools, there are still significant shortcomings in current hardware that make it hard to pinpoint users reliably, especially in dense urban settings and indoors. In these situations, which of course make up a large portion of a user’s day, even Foursquare and Google’s data-wrangling often don’t help. Apps may notify us to rate our experience at a Nike store while we’re actually grabbing ice-cream next-door. They most likely won’t be able to tell if we’re at a hotel bar or just checking in at the lobby nearby.
But that’s also about to change. In the coming years, we should brace ourselves for being pushed into an exciting future where those errors will virtually disappear and our apps will become extremely context-aware.
Let’s start with developments in satellite geolocation. For years, improvements in smartphone geolocation accuracy have been elusive, especially in urban settings where it’s possible for satellite signals to reflect off tall buildings and confuse our phones. Custom receivers that use ground base-stations for geolocation have been able to achieve very high accuracy numbers, but according to the government, the accuracy of GPS with current smartphone chips has remained at around 5 meters under open sky. However, in 2017, a new chip was announced by Broadcom that does much better. It uses a new radio frequency named L5, in addition to the L1 frequency that our phones currently use, allowing it to achieve an accuracy of 30cm–a 15x improvement. The chip is reported to have been included in the designs of several phones slated for release in 2018.

The additional signal also allows for significantly better accuracy in cities’ concrete canyons, due to its ability to distinguish between signals arriving directly from satellites and those bounced off of buildings around the device. And this is only the latest development in a global push for centimeter-level navigation accuracy. You can read more about the technical details of how the chip works and why using the L5 frequency helps here.

Has the Time for Beacons Finally Come?

Of course, improvements in satellite-aided navigation can only help when the user’s phone can see enough satellites in the sky, not when the device is carried indoors. The most common way for smartphones to reliably locate themselves indoors is through sensing Bluetooth Low Energy signals emitted by bluetooth beacons–inexpensive chips installed in public spaces. This is typically achieved using the so-called iBeacon protocol. The protocol was introduced by Apple in 2013 and allows for devices to constantly listen for beacons while minimally impacting battery life. The beacons repeatedly advertise a unique iBeacon identifier and the device uses them to detect the beacons in its vicinity, and by knowing their locations, approximately locate itself.
In 2013, the launch of iBeacon caused a wave of excitement and speculation about the possibilities of indoor location. Many retailers, airports, and museums, experimented with the technology, typically using it to push simple coupons or notifications to their users. Despite this initial traction, beacons never really took off as anticipated. Multiple shortcomings led to their limited adoption. Firstly, earlier beacon hardware had to be maintained after installation: their firmware had to be updated, their batteries had to be monitored and replaced, and since they didn’t connect to wifi, all of this had to be done by a person approaching each beacon phone in hand–an unreasonably costly process. They also simply didn’t offer many of the features that brought businesses real value: They could only be used if a business had convinced their customers to install their native mobile app, background tracking was limited (only a limited number of beacons could wake an application), and they had to densely cover a building to be used for navigation, which meant footing a large bill to install and maintain them.
Many of these problems have by now been solved. Several large beacon manufacturers, most notably the Polish company Estimote, have created beacons that are reasonably priced, yet reliable, and allow for remote monitoring and firmware updates through a bluetooth mesh network. They also allow for much longer replacement cycles, and precise location tracking using far fewer beacons. Their offerings drastically reduce maintenance costs and offer much anticipated features like background tracking and automatic mapping of indoor spaces.
In the past few years, there have also been signs of a decisive push for BLE beacons by Google, which has introduced its own Physical Web and EddyStone projects, and is leading the way in developing Web Bluetooth, all with the aim of allowing interaction with BLE devices without the burden of downloading an app. The new capabilities and lower costs may finally tip the balance in favor of beacons and we may see them enter businesses like hotels, airports, museums, and especially retail stores. The new capabilities might, for instance, help retailers in their fight to stay relevant: It could enable them to provide better recommendations to customers by analyzing browsing patterns in stores. They could use it to optimize product shelf placement, staff management, or to provide wayfinding solution to items in their stores.

Estimote’s iOS and Android SDKs allow for precise background monitoring for the first time.
Image from estimote.com

Show the App Where You Are

Only time will tell if iBeacon will finally deliver on its initial promise. But now, there’s an even more interesting and promising technology on the block: indoor location powered by visual signals. For a moment, let’s think about how we humans figure out where we are. Suppose you worked in a large office building for a long while. Now, imagine you were blind-folded and brought to a random location in your workplace. After looking around, you’d probably be able to tell where you are in a matter of seconds. Enter VPS. In a developer’s conference last May, Google revealed its Virtual Positioning Service, a successor to its Tango 3D project, which aims to enable devices to precisely locate themselves indoors down to a few centimeters. The tech was described as being “kind of like GPS”, but using “distinct visual features in the environment” instead of satellites. It’s likely that this type of location sensing would improve rapidly, because it can be used not only for smartphones, but also by autonomous robots or drones that want to know their exact position within the confines of concrete walls.

Google tweeted about its new VPS project during Google IO

Regardless of which technology ends up being the most useful and widely adopted, it’s clear that data scientists are seeing an explosion in the availability and usefulness of location data, and like always, it’s up to us to take advantage of it to benefit end-users while doing all we can to shield them from the danger of sacrificing their privacy.

Keshav Dhandhania is a cofounder of Compose Labs (commonlounge.com) and has spoken on GANs at international conferences including DataSciCon.Tech, Atlanta and DataHack Summit, Bangaluru, India. He did his masters in Artificial Intelligence from MIT, and his research focused on natural language processing, and before that, computer vision and recommendation systems.

Machine Learning with a Twist: Detecting Structural Differences in Complex Data