Interview with Metis Senior Data Scientist Rumman Chowdhury and Metis bootcamp graduate Nathan Wieneke

Our past Technical Chair, interviewed Metis Senior Data Scientist Rumman Chowdhury and Metis bootcamp graduate Nathan Wieneke, regarding their 12 week Data Science bootcamp and some of the interesting projects that they’ve had their students working on, including a student success story.

Metis is an intensive 12-week data science bootcamp that focuses on teaching practical data science. Our curriculum is project-based, so our graduates enter the job market with a portfolio of work they thought of and executed themselves. Students are given the skills and tools to source their own data, conceive and execute their own projects. During that time, instructors and TA’s are available to help advise and guide. Nate was one of the graduates from the Spring cohort in San Francisco. His project is an excellent application of practical Natural Language Processing.

What is the purpose of the project, broader business overview, why this is a good application of NLP, beyond novelty?

RC) Tweet Groups was a business-minded final project, because it could be used in consumer segmentation to drive decisions. Traditionally, consumer segmentation is done on the entire consumer base. Because of how tweet groups works, this model can be applied to a brand, a product, or a campaign. The project scrapes Twitter for hashtags that are used with other hashtags, and uses dimensionality reduction and clustering to group them. So, for example, if a company launches a new product, this project takes your product’s hashtag and all other hashtags used with it and finds subject areas that are interesting and relevant based on people’s self-provided preference areas.

From a business perspective, there are two things I like about it: it’s quick to execute, and it’s agnostic. The benefit of being quick to execute is that this segmentation can be performed on as large or small a group as necessary. It’s useful for segmenting your entire consumer base, or segmenting the people who were responsive to a specific campaign. Second, traditional segmentation methods can rely on surveys or incentivized consumer reviews, which can be biased. Similarly, these methods can ‘beg the question’ – which means that to structure these traditional analyses, you are often assuming certain characteristics are already important. By using hashtags, the input data is user-provided, and using a cluster analysis will group on quantified word similarity instead of subjective preferences or characteristics. Tweet Groups can potentially uncover user segments that the data scientist or marketer did not know of.

Data: how it was sourced, problems with using twitter data and how it is a special case of language processing? Overarching question – is dimensionality reduction/topic modeling actually useful/beneficial to twitter data analysis?

RC) Twitter data leads to a lot of noise. There’s overuse and misuse of hashtags, and the key is parsing out the useful information buried within. Twitter text analysis can be challenging, as people use emoji and twitter-specific acronyms and language. Dimensionality reduction is key to grouping similar sentiments that are expressed slightly differently. It’s particularly important to have reduced dimensionality when working with clustering algorithms that do not have a specified parameter for the number of clusters, otherwise you can end up with a lot of clusters. Similarly, for algorithms where you do not identify a density parameter, you can also end up with small clusters that ultimately are not valuable for your end business goals.

Why would you recommend using SpaCy for comparing different text cleaning modules?

RC) For this use-case, SpaCy has a few advantages over other libraries. SpaCy is written in Cython so it outperforms other text modules for twitter-style data. It handles emojis and lemmatizes text quite well. It also has a wide array of scalable features.

Based on your experience, which models do you prefer, including pros & cons, of different models of Dimensionality reduction?

Nate with input from Rumman)

  1. LSA – fastest method, ‘flattens’ dimensions to grab the most variance
    1. Finds topics that explain most of text variance
    2. May not be as ‘human readable’ as other methods
    3. Can find topics in unbalanced distributions
    4. Can have trouble with misspellings, typos
  1. LDA – Assumes a Dirichlet Prior (which may not apply to Tweets)
    1. More resistant to overfitting
    2. Assumes topical distribution, so provides a “% belongs” for each topic
    3. No TFIDF for LDA (which is crucial for parsing twitter data)
    4. Tweets may only have 1 topic

A Final project demo can be viewed here
Chek out the Ipython notebook

You’ll notice in the final project demo that the cluster groups are easy to view and present. Here’s where the ‘art’ of data science comes in – it’s up to the scientist to parse out the necessary information and uncover insights. While the model is agnostic, it requires some degree of subject matter expertise or critical thinking to see the connections and potential actionable insights.

The example in the video demo is for the GoPro brand of active cameras. You’ll see two interesting groups come up that may have otherwise been lost in the noise. The first interesting group in cluster 2 appears to be a large group of Twitter users who are tweeting about using a GoPro to film shark diving adventures. The second group appears in cluster 5, where Twitter users are discussing using GoPro with dog harnesses. By looking at these clusters a marketer is now equipped a list of users to engage, a list of hashtags that these users have tweeted, and the other necessary tools to engage their customers on Twitter.

Bios:

rummanchowdhury
Rumman Chowdhury
Senior Data Scientist
Metis

Rumman comes to data science from a quantitative social science background. Prior to joining Metis, she was a data scientist at Quotient Technology, where she used retailer transaction data to build an award-winning media targeting model. Her industry experience ranges from public policy, to economics, and consulting. Her prior clients include the World Bank, the Vera Institute of Justice, and the Los Angeles County Museum of the Arts. She holds two undergraduate degrees from MIT, a Masters in Quantitative Methods of the Social Sciences from Columbia, and she is currently finishing her Political Science PhD from the University of California, San Diego. Check out what she’s up to on her website: www.rummanchowdhury.com, or on twitter @ruchowdh.

nathanwieneke
Nathan Wieneke
Data Scientist
Datum & Co.

Nathan comes from a background of Public Policy research with a Bachelors in Economics from UC San Diego. After organizing Congressional visits for clients in Washington DC, he returned to the Silicon Valley to provide technical support to clients at Backblaze Online Backup. There he began utilizing Data Science to achieve business goals by building a unsupervised learning model to identify clusters of users and their support needs. Nathan excels at communicating complex concepts to non-technical clients, and is enthusiastic about using data to further engage customers. Nathan works for Datum & Co., a full-stack data science consulting group in San Francisco, you can see a live demo of Tweet Groups on the Datum & Co. website: https://datumco.io/d3/tweet-groups/

Link to Slideshow