One of our MLconf Program Committee Members, Reshama Shaikh, recently interviewed Kavita Reddy, Data Science Manager at Second Measure in San Mateo, CA.
RS/ Q1) Tell us briefly about yourself and your work.
KR) I’ve been at Second Measure for 3 years now and joined as the first data scientist. I am grateful that I love going to work everyday. My work involves playing with data, finding biases and thinking about what to do about them, working with clients on custom research projects, thinking about what to build in the product, and managing an awesome team. Being a data scientist at a company where data is the product is a fantastic position to be in.
Prior to Second Measure, I was a research psychologist studying what people do and why they do it. That core focus hasn’t changed — I’m still trying to figure out what people are doing and why they’re doing it. It’s just that now I’m studying their actual spending behavior as opposed to their responses to surveys.
RS/ Q2) This is the “About the Company” on Second Measure: Second Measure analyzes billions of credit card transactions to answer real-time questions on consumer behavior.
Are you able to share some of the real-time questions that you investigate?
KR) We have billions of bank and credit card transactions from a panel of millions of anonymized US consumers from the past several years. This data drives core strategic decisions for our clients: how to invest multi-billion dollar portfolios, how to allocate nine-figure marketing budgets, and how to prioritize large partnerships and acquisitions. The most incredible thing about this data is the breadth of questions we can answer and the depth we can go into.
The analyses our clients do or that we do for them can be quite intensive, but our blog highlights a flavor of questions we can answer:
- What percent of Tesla Model 3 reservation deposits have been refunded
- How is Prime turning Amazon into a subscription-driven business?
- What effect does becoming a Stitch Fix customer have on one’s overall retail spend?
Answering general questions like these has helped us become a trusted source of data for company performance and we are often cited by the Wall Street Journal, Financial Times, among others.
Beyond looking at company performance, we can also examine the impact of economic events on a macro level, or of the impact of life events on saving and spending behavior on a micro level.
RS/ Q3) For those of us who are unfamiliar, what is “panel data” in credit card transactions? (Note: refer to abstract)
KR) Panel data refers to data that comes from repeated measurements from a specific set of people meant to represent a population at large. It’s a method that allows for tracking behavior within individuals over time. These observations within individuals are not independent and there’s an entire class of techniques used to analyze this data and track behavior within individuals over time.
In our case, we’re studying a panel of millions of anonymized US consumers over several years. The billions of transactions we get from this panel gives us insight into consumer behavior.
RS/ Q4) Can you list some of the KPIs (Key Performance Indicators) that are meant to be tracked?
KR) The goal of this data is to help investors understand company performance. To that end, the main KPI is sales — how much a company sold to consumers over a given time period. Trends in transaction volume and number of customers complement sales trends. Beyond that, we surface growth and retention metrics, often by way of diving into cohorts. We also surface analyses that take a deeper look into a company’s customers, like where else these customers like to shop and their propensity to do so. While sales is what matters most to our clients, they often need to drill down on the trends into growth, retention, and the nature of the customers to understand how to value a company.
However, if one were to run these analyses against the raw data, it would have a low correlation with the company-reported numbers and it would generally be a mess. The data has to be adjusted and normalized before it can be useful.
RS/ Q5) What are the big data tools you use to manage data of this size, billions of transactions?
KR) Our pipeline is pretty specific to the analysis at hand, but has a standard outline: AWS tools, Spark, lots of custom Python and SQL. At the end of the day, our clients use Excel, so we have to distill these billions of transactions into something digestible and analyzable at a smaller scale.
RS/ Q6) What are some of the unsupervised methods you use to evaluate churn?
KR) We’re not given much information on panel members or their accounts so we have to infer many things, like when they churn. Yet, defining churn and discovering variability in it is critical to normalizing the data.
Furthermore, our company is about enabling our clients to find their own stories in data. So, there’s always a tension between robust, unsupervised methods and interpretable ones that map this variability back to demographically defined subgroups.
That said, clustering methods are helpful. We can use such methods to identify and define groups of consumers who behave similarly based on frequency of transacting, companies where they shop, income and spending level, geographic location to name a few dimensions. From there, we can model transaction volume and then monitor the groups of consumers for any drop-offs over time.
RS/ Q7) What external data sources do you utilize to enhance your models?
KR) We mainly use publicly available data like census reporting to understand and correct for biases. Beyond that, we use data on banking holidays to enhance our models of when transactions occur; we use data on where retail companies are located to enhance our models on where transactions occur. With limited external sources of ground truth out there, we have to get creative with data validation and also be okay with some amount of uncertainty. The main source we use is quarterly reported revenue from public companies and one of our ultimate success metrics is our ability to match these numbers.
RS/ Q8) Can you share your methodology on post-stratification bias management?
KR) I’ll start by saying that while “alternative data” is relatively new in investing, panel data has been used in the social sciences and many other fields for a long time. There is a lot of literature to lean on for bias management and a lot of advancements due to to better computing and the availability of large datasets.
Poststratification is a method that allows you to adjust weights to account for over- or underrepresented sampling from the population of interest. In our case, we’re working mostly with large financial institutions and with people who have access to credit card accounts. Moreover, it’s safe to assume that sampling is not random. This can mean that lower income individuals are underrepresented, for example, or that urban individuals are overrepresented.
To account for this, we identify or infer as many things as we can about who is in the panel (we’re not given demographics on panel members). For example, we can infer location based on where the panel member make transactions and we can infer income based on deposits to the panel member’s bank accounts.
Next, you have to find out whether there is bias in your dataset based on the characteristics you were able to infer. To do so, you have to get appropriate, externally available datasets on the population of interest and then test whether your panel differs significantly from this population on various demographics.
Then, you can make adjustments to the panel data based on the magnitude of the discrepancy between the panel members’ characteristics and the population’s. Specifically, you can make adjustments by weighting panel members according to how much of the population they makeup and by how much they were over- or under-sampled.
You’ll want to adjust for bias in the panel from the get-go, and also monitor for bias that emerges as panel members drop out.
RS/ Q9) What are your thoughts on these two trending topics:
b) algorithm bias
KR) We take care to ensure that our data is free of personally identifiable information, as do our partners. We never share transaction-level data and we only characterize behavior in aggregate from large samples. Privacy is a big issue in this space and I’d expect many things around data collection and sharing to change as it becomes more widespread and as more groups get involved in privacy issues.
Regarding algorithm bias — this was part of what motivated me to join tech. In my years as a social psychologist, I studied a variety of implicit and explicit biases that people bring to the table. Before machines learn, we have to give them training data and because of our own biases, we can feed them biased data. Social scientists are often particularly well-suited to understand this problem, but different fields and backgrounds have their own ways of dealing with non-random, biased data. Diversity of thought within teams is so important to highlighting and mitigating algorithm bias.
Within the alternative data space, we know that sampling isn’t totally random and we know that biases exist and emerge over time. As a result, we have to constantly revisit our model assumptions. Regular reviews of assumptions is totally different than regular reviews of results. This helps mitigate algorithm bias, but I’m not sure that anyone is immune from it. It’s up to everyone to question their assumptions and their data.
Kavita Reddy is a data scientist at Second Measure, where she leads the data services team. As the first data hire, she has worked on everything from data enrichment and de-biasing to metric development and custom research for clients. Kavita has spent more than a decade working with behavioral data in both the public and private sectors. She holds a PhD in Psychology from Columbia University and a BA in Psychology and Economics from UCLA.
Reshama Shaikh is a data scientist/statistician and MBA with skills in Python, R and SAS. She worked for over 10 years as a biostatistician in the pharmaceutical industry. She is also an organizer of the meetup groups NYC Women in Machine Learning & Data Science and PyLadies. She received her M.S. in statistics from Rutgers University and her M.B.A. from NYU Stern School of Business.