One of our MLConf Program Committee members, Reshama Shaikh, recently interviewed Cassie Kozyrkov, Chief Decision Scientist at Google.
Questions & Answers
RS/ Q1) Tell us briefly about yourself and your work.
CK) I serve as Chief Decision Scientist for Google Cloud. For most of my career, I had job titles related to the data sciences: analyst, statistician, statistics lecturer, and data scientist. Data is beautiful, but I also believe it needs a reason. For me, that reason is decision-making. It’s through our actions that we affect the world around us. Although I studied statistics both in college and grad school, I worked to augment my training by earning degrees in the decision disciplines: economics, psychology, and neuroscience (and I’m always learning as much as I can about other perspectives on decision-making).
Early in my career, I started taking a decision-oriented approach to my work wherever possible and saw the incredible boost that incorporating wisdom from the social and managerial sciences can bring to applied data science. It turns out that there’s a lot of synergy we miss out on if we put up arbitrary walls around disciplines.
At Google, I had the benefit of an extremely collaborative environment that encouraged me to share what I knew so others could build on it. We didn’t have a name for it when we started, but eventually this approach to applied data science that incorporates the social and managerial sciences would become known to us at Google as Decision Intelligence. Now that I’m with Google Cloud, my mission is to help the whole world make their data even more useful and to share these ideas so everyone can benefit.
RS/ Q2) What are a couple of your favorite projects you have worked on at Google?
CK) Different projects are favorites in different ways. I loved doing statistical consulting within Google because it gave me a broad view of what the company is up to and it’s great fun to be involved in many different things at once. Of course, as a consultant you don’t get the same deep exposure as you do if you are the main specialist working for several years on a single project.
Another favorite was working to measure and eliminate duplicate business listings on Google Maps. This is a fun example because it’s easy to understand – if you use Google Maps, you’d probably not be a fan of seeing your search for “coffee near me” show you 20 slightly different results that all point to the same physical cafe, right? – but it’s pretty hard to define a good metric for scoring the efficacy of the machine learning system we built to remove potential duplicates. If you’d like to experience some applied mind-bending, take a moment to think about how you’d define a score for duplication in a database if the number of duplicates could be any integer. How would you design a sampling scheme to get at that metric?
I also have a special love for data splitting, power analyses, and experiment sizing. Always great fun, as are the data-mining, feature engineering, and applied machine learning projects. Wow, it’s so hard to choose! Data scientists are lucky to have so many fabulous avenues to explore!
But probably best of all was figuring out ways to make data science more useful. From designing data science training for thousands of Googlers to figuring out best practices in process and team composition (i.e. how to pass the baton between team members with diverse skills to make data as useful as possible), the challenge is extremely rewarding.
RS/ Q3) Do you primarily use R, Python or another coding language?
CK) My guess is that people tend to love what they grew up with in college/grad school, and for me that was R. It’s hard not to compare a new language with your comfy one; when I started learning Python, I remember how my jaw dropped at the relative fuss it takes to get the equivalent of R’s one-sided prop.test(). But Pythonistas have their legitimate reasons for singing Python’s praises, one being the scope of what you can do with the language and how efficiently you can do it, especially if your interests span beyond data science. I actually tried to force myself to go cold turkey with R for a year to give myself time to get really good at Python… and broke several months shy of the goal because R was a little too tempting for one of my projects. My verdict? Both languages are great and it’s impolite to bully one another about which is The Chosen Tongue. Let’s not do that! I just happen to be more comfy with one than the other as a result of my statistical upbringing.
RS/ Q4) We met a little over 3 years ago. It is only recently I have seen your blogs popping up in my LinkedIn feed, and I have enjoyed reading them. What motivated you to start blogging?
CK) A colleague of mine suggested I try my hand at it. People had been urging me to write a book for a while now and I figured a blog was a bit like a small version of a book. Testing at small scale before diving into the entire whopper of a thing seems like a good idea in general, especially to the statistically-minded. I had no idea there would be such an incredible response to it. I’m thrilled by the delight that the community has found in it. It’s great motivation to write more!
RS/ Q5) You are passionate about statistics and statistical education, having designed the most popular statistical training course within Google. As a statistician myself, I share your enthusiasm, and I appreciate especially your blogs on statistics. If you could give educators one piece of advice, what would it be?
CK) Be unboring! A principle that guides my teaching is this: once you have earned your students’ attention, then, and only then, have you earned the right to dive into detail. Otherwise the student is just going to forget them and it’s a waste of everybody’s time. So it’s up to educators to make it unboring. If students are only learning something because you told them to learn it, it’s a miserable experience all around. Life is short, why not have more fun with it? Your students will learn more and you might just find you’re enjoying the experience much more too!
RS/ Q6) Is the experience of teaching and learning statistics as daunting as is commonly believed?
CK) The experience of learning statistics shouldn’t be daunting, but perhaps the experience of teaching it should be… because our standards ought to be higher. Understanding the material is not enough for being a great teacher. To really connect with an audience, you got to embody three roles in one: you’ve got to be an expert (to master the material that you are going to share), a teacher (to design the lesson), and an entertainer (to delight students with a performance that makes them eager for more).
RS/ Q7) Your explanations of statistical terms in your blog, Statistics for People in a Hurry, are outstanding. Is there a plan to promote this approach to mainstream statistics education?
CK) First off, I wouldn’t want to assume that my approach is the right approach. What it does embody, though, is a desire to modernize. I’ve never been a fan of teaching “the way it has always been done” – that seems like a flimsy reason for doing anything. But my approach is not the only one that tries to do better and be a little more brain-friendly than the old fashioned textbook-copyout approach. There are many others and I applaud them all for their efforts.
In order for good teaching to be mainstream, society has to value good teachers. If universities prefer to value research abilities above teaching abilities when hiring faculty, I’m sure you can guess the punchline without my saying it…
RS/ Q8) How do you think the American Statistical Association (ASA) could be more involved with the data science / Python / R communities?
CK) I’ve met quite a few statisticians who don’t think they belong in the machine learning discipline. I’m not sure why this happens, but perhaps the ASA might be able to make a difference by highlighting bridges between applied statistics and applied machine learning. After all, if you want to try make a working solution, apply machine learning, but if you want to be sure your working solution actually works, you’re going to need statistics too. In the AI realm, machine learning and statistics are lost without one another. (Ah, and I see the analysts waving frantically at us. Yes, analysts, you’re vital as well, especially if we need to figure out how to zero in on a great solution quickly. Without you, we might end up chasing our tails for several lifetimes without getting anywhere…)
As for involvement in data science / R / Python communities, well, you know my views are even stronger – I’m all for lowering the arbitrary walls we’ve made around not only communities but entire disciplines so everyone can collaborate more. I’m all for more involvement and collaboration all around.
RS/ Q9) What are 3 pieces of advice to offer to aspiring data scientists? What are your favorite resources that you would recommend to them?
- Useful is worth more than complicated.
- Data quality is worth more than method quality.
- Communication skills are worth more than yet another programming language.
Book and resource recommendations are always hard, since it depends on the style of data guru you want to be.
If you’re interested in analytics and you haven’t heard of Tufte’s books… happy belated birthday to you!
If you’re interested in statistics, please read books on the history of your discipline (The Lady Tasting Tea is nice start) as well as on epistemology. I also suggest working through intro texts in both Frequentist and Bayesian statistics so you see both sides of the philosophical approach. The graduate schools I attended start their budding statisticians on Hoff’s A First Course in Bayesian Statistical Methods and Casella & Berger’s Statistical Inference. I can definitely get behind these great choices!
Now that I’ve been waxing lyrical about textbooks, let’s try a submission in the videos category! Machine learning engineers might enjoy the Google Cloud Platform channel on YouTube for great practical resources. If you’re looking for something non-Google, I’m a fan of Siraj Raval’s videos because he goes to great lengths to entertain while he enlightens. I’m also perennially impressed by DataCamp’s pedagogical approach to teaching data science.
If you’re inclined towards the more researchy side of AI, start with Goodfellow, Bengio, & Courville’s Deep Learning textbook. If you’re a statistician looking for an easy stats-to-ML bridge book, start with Introduction to Statistical Learning by Witten, James, Tibshirani, & Hastie.
Since data science is all three areas (machine learning, statistics, and analytics), if you’re aiming at becoming the real deal, consider browsing them all. Watching and reading is not enough, though. Make sure you get hands-on practice! If you don’t already have a dataset you want to dive into, I recommend trying out Kaggle.
Oh, and don’t forget to learn how the cognitive biases you carry around as a member of the human species trip up your ability to reason about data. Books like How We Know What Isn’t So, Predictably Irrational, and Thinking Fast & Slow will keep you from taking yourself more seriously than you ought to. If you’re keen to go deeper, Glimcher’s Neuroeconomics: Decision Making and the Brain or Camerer’s Behavioral Game Theory might just feel like the best thing since sliced bread.
RS/ aside) Cassella & Berger’s Statistical Inference is one of my favorite books from grad school, and I often reference it.
RS/ Q10) There is a dire need for both experienced data scientists and data science leaders. What can the community do to help realize this outcome?
CK) Leadership in data science takes a double black belt. If you want this job, you need to understand the business (from impact to strategy to organizational politics to resource allocation to people management to all the rest of it) and understand data science as well. Then you have to put these two together. That’s harder than it sounds! The skills are very different and they are rarely developed in a single individual. We need people who are eager to do both and we need to appreciate that part of that is encouraging young minds to consider that joint educational path. As it stands, the emphasis in data science training is skewed towards the technical side. I’m saddened when I hear students pronouncing “soft skills” like it implies “soft-minded.” As a community, we should start consciously discouraging that messaging if we want to make progress. To invest in a new batch of data science leaders, we need to attract them to the idea of earning both black belts. This means we need an alternative definition of what it means to be impressive in data science. Let’s convince the new crop that it’s not just about how deep their measure theory goes.
RS/ Q11) Here are a few of the “hot topics” in data science. What are your high-level thoughts, in 1-2 sentences, for each topic.
- “The world represented by your training data is the only world you can expect to succeed in.”
- “Data are not magic. To learn from examples, the examples have to be good.”
- “If you keep putting innocent hypotheses on trial, eventually one’s going to look guilty. That’s not science, it’s the search for a slice of toast that reminds you of the Mona Lisa. If you only publish the slide with the face and stay silent about the rest of your attempts, don’t be surprised when others fail to reproduce your findings.”
- “In applied machine learning, it’s a terrible idea to get multiple carbon copies of the same worker. You need a diversity of skills and perspectives to successfully carry out an applied machine learning project. You need different kinds of people with different skills.”
- “It is really hard to think of everything alone, we’re not that creative. The more identical your teammate is to you, the less ground your ideas will cover together. Diversity helps creativity flourish.”
- “The community as a whole is so much stronger and more creative than any one of us on our own.”
- “The way to get the best possible tool/algorithm is to share your prototype with the community and see how they improve on it.”
Opportunities for entry level data scientists
- “Focus less on the title and more on the value that you can add. The world is generating data like never before, so it’s time for all of us to work to make it more useful.”
- “Outside research/academic settings, don’t be too focused on complex methods. If you can make your company several million dollars with one histogram, you’re a winner! Do it and be proud of yourself. Save complex solutions for when simple doesn’t work.”
RS/ Q12) When I googled you in preparation for this interview, I saw that you have been speaking and traveling extensively as part of your role as a thought leader in the field. The schedule seems grueling. What advice can you share in effectively managing this schedule?
CK) My secret is that I love traveling. Whenever I board a plane, I feel like I’m getting away with something. I’d better not tell anyone – when the Fun Police find out how much fun I’m having, they might try to stop me! I get a lot of energy from being in a new place where the colors are different, where the air smells different, and, for bonus points, where I can’t understand a thing anyone is saying. That energy helps me keep up an intense schedule.
Being on the road constantly is still exhausting, even if you love it as much as I do. The real secret is knowing your limits and having clear boundaries and priorities. It’s very easy to want to say yes to everything (like answering every single email and message), but it’s important to save your strength for what matters. Focus on the highest priority tasks and forgive yourself for dropping some of the others.
Schedule aside, here are two life-changing words of advice for frequent travelers: compression socks!
RS) Thank you for participating in this interview.
Bio for Cassie Kozyrkov
As Chief Decision Scientist at Google Cloud, Cassie Kozyrkov advises leadership teams on decision process, AI strategy, and building data-driven organizations. She works to democratize statistical thinking and machine learning so that everyone – Google, its customers, the world! – can harness the beauty and power of data. She is the force behind bringing the practice of Decision Intelligence to Google and she has personally trained over 15,000 Googlers in machine learning, statistics, and data-driven decision-making. Before her current role, she served in Google’s Office of the CTO as Chief Data Scientist. Prior to joining Google, Cassie worked as a data scientist and consultant. She holds degrees in mathematical statistics, economics, psychology, and cognitive neuroscience. When she is not working, you are most likely to find Cassie at the theater, in an art museum, exploring the world, or curled up with a good novel.
Bio for Reshama Shaikh
Reshama Shaikh is a data scientist/statistician and MBA with skills in Python, R and SAS. She worked for over 10 years as a biostatistician in the pharmaceutical industry. She is also an organizer of the meetup groups NYC Women in Machine Learning & Data Science (http://wimlds.org) and PyLadies. She received her M.S. in statistics from Rutgers University and her M.B.A. from NYU Stern School of Business.
- Twitter: @reshamas