One of our MLconf Program Committee Members, Reshama Shaikh, recently interviewed Narine Kokhlikyan, Machine Learning Engineer at Slice Technologies Inc.
RS/ Q1) Tell us briefly about yourself and your work.
NK) I work on exciting machine learning projects at Rakuten Intelligence. The projects range from classifying billions of electronic receipts and extracting ecommerce insights to detecting anomalies in time series data. One of those projects aims to detect anomalies in the number of order counts and price values per merchant. Early identification of these anomalies is critical especially if it has to do with data quality and completeness.
As I started to work on this project, I’ve noticed that many companies in the industry are dealing with similar problems. Those problems include detecting anomalies in the number of requests to a server, number of trips made or users joined. I spent some time analysing existing solutions, their limitations and came up with two approaches which were the best fit for our use cases. Those two approaches include generalized additive regression models and recurrent neural networks.
RS/ Q2) Time series is a data that seems absent of a standard algorithmic approach. What makes time series so challenging and what advice would you give to data scientists analyzing this type of data?
NK) Data sparsity, noise, diverse and changing temporal patterns can make time series analysis very challenging. Since time series is sequential many missing data points can change the shape of the patterns and make the reconstruction very challenging. Temporal patterns can also change over time which makes the detection of those changing points essential. And lastly, distinguishing noise from the real signals can be very challenging especially for small datasets. Although there are different algorithms which help to overcome those problems they are applicable only to a limit set of use cases.
I would advise data scientists who analyse this type of data to spend some time on understanding the dataset and identifying key components of the signal such as periodicity, cyclicity, trend and extreme events. Correlograms and signal decomposition techniques such as Winter Holt method are good starting points. It is better to start with simple approaches such as regression analysis and later move towards deep learning and more complex models. Specifically, for deep learning there are many different ways of representing the features and representing the architecture. Depending on the optimization objective for a specific use case, either approach can be prefered.
RS/ Q3) Do you observe significantly unbalanced classes in anomaly detection? If yes, what is your approach?
NK) While there is an imbalance between anomalous and non-anomalous observations, in our use case, we do not have sufficient labeled training data. Thus at this point of time we are not using any supervised classification approaches to classify anomalous or non-anomalous observations. We rely on statistical tests and simple heuristics with thresholds in order to identify those anomalies. For example, if a predicted value with a high confidence value diverges significantly from actual value within a sliding window, then the case is marked as anomalous.
RS/ Q4) If I follow your abstract accurately, your work is with: time series + NLP + anomaly detection. Is this the data with which you work? Can you share your approach in working with the data you encounter?
NK) Sequence modeling algorithms used in NLP, for instance, can be also very useful for time series analysis, since in both cases the sequence of observations is essential in learning underlying patterns and behaviours.
Recurrent Neural Networks (RNNs) allow us to model sequential behaviour with minimal feature engineering and are not limited to specific signal components such as trend, seasonality and cyclicity. RNNs predict `k` steps ahead of the time the next observations in the sequence. Predicted sequences with high confidence score are compared with the actual values and the ones with large deviations from actual sequences are treated as anomalies. We run extensive experiments on 600+ different trends by treating order counts per merchant as observations along the time axis. Ultimately, we compute Precision, Recall and F1 score across all 600+ trends for found anomalies. The experimental results showed overall 86% precision, 80% recall and 83% F1 score across the entire dataset for order counts.
RS/ Q5) Can you explain t-SNE and your contribution to the scikit-learn library? (http://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html )
NK) T-SNE is a well-known algorithm for visualizing high-dimensional data. The details of the algorithm can be found here. Scikit-learn library offers an implementation of T-SNE in its core manifold package. One of its hyper-parameters, namely the ‘perplexity’, has a significant effect on the shape of the visualization. More specifically, depending on the data, higher perplexity values result in more meaningful visualizations. The contribution discusses different scenarios and effects of the perplexity on the shape of the visualization. It also provides code snippets for reproducing those effects.
RS/ Q6) Can you explain recurrent neural nets and generalized additive regression models in the context of your work?
NK) Both generalized additive model (GAM) and Recurrent neural network (RNN) use historical temporal and sequential data for training and perform forecasting for the following month or year.
GAMs allow us to incorporate trend, seasonality and cyclicity as smoothing functions into the objective function. For instance, trend can be represented through generalized logistic function, and both seasonality and cyclicity can be represented through variations of fourier series.
Unlike GAMs, RNNs learn complex non-linear functions without requiring considerable feature engineering or pre-defined smoothing functions. In our experiments, we used a two layer RNN with drop outs and sequences of observations as inputs. The network predicts `k` data points following each input sequence.
Ultimately, an ensemble approach makes the final prediction by comparing the outputs and anomaly overlaps found by both approaches.
RS/ Q7) How did you begin contributing to scikit-learn and Apache and what advice do you have for those in the community who would like to contribute to open source?
NK) I started contributing to Apache Spark when I was still working at IBM, and scikit-learn when I was working at Slice, and used scikit-learn for one of my projects. Often, it is best to start by looking for limitations and then propose solutions to either improve or replace existing functionality or add a completely new solution.
Those who want to contribute I’d advise to start with something small and pay close attention to detail and the usefulness of the contribution. Open source contributions sometimes require a lot of patience and long discussions. Nevertheless, they are very useful and help build different perspectives in problem solving.
RS/ Q8) Here are a few of the “hot topics” in data science. What are your high-level thoughts, in 1-2 sentences, for each topic.
I believe that no matter how practical and theoretically sound an algorithm is, it is reliable only under certain constraints and assumptions. Outside those assumptions and constraints, its expected behaviour is not guaranteed.
Reproducibility is key, especially in research and experimentation. It allows us to gain better understanding of those algorithms and processes. As a consequence, the better we understand, the more likely it is that we will reuse or improve it in the future.
Diversity is important in all aspects. It contributes to the balance in general and creates more harmonic and creative environments.
Open source community is a great resource both for reproducing issues and sharing knowledge. It unites engineers and scientists spread out across the globe who have a common goal to use and continuously improve those tools and technologies.
Opportunities for entry level data scientists
There are many resources available on the internet about data science. I’d recommend taking online courses and working on real world problems.
RS) Thank you for participating in this interview.
Narine Kokhlikyan is a machine learning engineer at Rakuten Intelligence, where she builds end-to-end machine learning solutions which extract insights from millions of users’ mailboxes. Most recently, she also developed approaches for anomaly detection in time series using state of the art deep learning algorithms and generalized regression models. Narine studied at the Karlsruhe Institute of Technology in Germany. Her research focused on cognitive systems and natural language processing. She is also an enthusiastic contributor of open source software packages such as scikit-learn and Apache Spark.
Reshama Shaikh (https://reshamas.github.io) is a data scientist/statistician and MBA with skills in Python, R and SAS. She worked for over 10 years as a biostatistician in the pharmaceutical industry. She is also an organizer of the meetup groups NYC Women in Machine Learning & Data Science (http://wimlds.org) and PyLadies. She received her M.S. in statistics from Rutgers University and her M.B.A. from NYU Stern School of Business.