Guest Blog by Michael Alcorn, Senior Software Engineer, Red Hat

Introduction

Although many companies today possess massive amounts of data, the vast majority of that data is often unstructured and unlabeled. In fact, the amount of data that is appropriately labeled for a specific business need is typically quite small (possibly even zero), and acquiring new labels is usually a slow, expensive endeavor. As a result, algorithms that can extract features from unlabeled data to improve the performance of data-limited tasks are quite valuable.

Most machine learning practitioners are first exposed to feature extraction techniques through unsupervised learning. In unsupervised learning, an algorithm attempts to discover the latent features that describe a data set’s “structure” under certain (either explicit or implicit) assumptions. For example, low-rank singular value decomposition (of which, principal component analysis is a specific example) factors a data matrix into three reduced rank matrices that minimize the squared error of the reconstructed data matrix.

One common application of low-rank singular value decomposition is latent semantic analysis (LSA) for document clustering. The traditional way to represent documents in information retrieval applications is as a bag-of-words matrix, D, where Di,j contains the count of the term indexed by j in the document indexed by i. Because the documents are represented as vectors, you can calculate the (dis)similarity between any two documents by looking at their Euclidean distance or cosine similarity. This might come in handy if you wanted to, for example, make content-based recommendations of web pages. However, using raw bag-of-words vectors will generally lead to suboptimal recommendations because certain irrelevant term matches can be overemphasized and synonyms are ignored. By performing a singular value decomposition of the bag-of-words matrix, documents and terms are factored into separate, low-dimensional representations that account for the correlations between terms (and documents). Because these lower-dimensional representations typically do a better job of capturing the “semantics” of the documents, they often lead to better clusters and recommendations.

Representation Learning

While traditional unsupervised learning techniques will always be staples of machine learning pipelines, an alternative approach to feature extraction has emerged with the continued success of deep learning – representation learning. In representation learning, features are extracted from unlabeled data by training a neural network on a secondary supervised learning task.

Due to its popularity, word2vec has become the de facto “Hello, world!” application of representation learning. When applying deep learning to natural language processing (NLP) tasks, the model must simultaneously learn several language concepts: (1) the meanings of words, (2) how words are combined together to form concepts (i.e., syntax), and (3) how concepts relate to the task at hand. For example, you may have heard of the neural network that generates color names. Essentially, the model was trained to generate names for colors from RGB values using paint swatch data. While the concept was really neat, the results were fairly underwhelming; the model seemed to produce nonsensical color names and randomly pair names with colors. The task, as attempted, was simply too difficult for the model to learn given the paucity of data.

word2vec makes NLP problems like the aforementioned easier to solve by providing the learning algorithm with pre-trained word embeddings, effectively removing the word meaning sub-task from training. The word2vec model was inspired by the distributional hypothesis, which suggests words found in similar contexts often have similar meanings. Specifically, the model is trained to predict a central word given its surrounding words in a window of a certain size. For example, for the sentence “Natural language processing can be difficult,” and a window size of three, the input/target combinations for the neural network would be:

  • [“natural”, “processing”] → “language”
  • [“language”, “can”] → “processing”
  • [“processing”, “be”] → “can”
  • [“can”, “difficult”] → “be”

When trained with enough data, the word embeddings tend to capture word meanings quite well, and are even able to perform analogies, e.g., the vector for “Paris” minus the vector for “France” plus the vector for “Italy” is very close to the vector for “Rome”. Indeed, by incorporating word2vec vectors into the color names model, I was able to achieve much more compelling results.

Representation Learning @ Red Hat: customer2vec

Red Hat, like many business-to-business (B2B) companies, is often faced with data challenges that are distinct from those faced by business-to-consumer (B2C) companies. Typically, B2B companies handle orders of magnitude fewer “customers” than their B2C counterparts. Further, feedback cycles are generally much longer for B2B companies due to the nature of multi-year contracts. However, like B2C companies, many B2B companies still posses mountains of behavioral data. Representation learning algorithms give B2B companies like Red Hat the ability to better optimize business strategies with limited historical context by extracting meaningful information from unlabeled data.

In many ways, web activity data resembles the data found in NLP tasks. There are terms (= specific URLs), sentences (= days of activity), and documents (= individual customers). With that being the case, web activity data is a perfect candidate for doc2vec. doc2vec is a generalization of word2vec that, in addition to considering context words, considers the specific document when predicting a target word. This architecture allows the algorithm to learn meaningful representations of documents, which, in this instance, correspond to customers. Example Red Hat data can be seen in Figure 1 where each line represents a different customer, each number represents a distinct URL, and the special [START] and [STOP] tokens denote the start and end of a day of activity. Once the data is in this format, training the model is as simple as two lines of code with gensim (Figure 2). These customer representations can then be used to form better segments for sales campaigns and product recommendations.

Figure 1. Sample web activity data used to discover Red Hat customer vectors with doc2vec.

Figure 2. Training doc2vec on web activity data with gensim (top) and a function for fetching customer vectors (bottom).

But representation learning models can be even more flexible than word2vec and doc2vec. For example, I’ve found that predicting at-bat outcomes for Major League Baseball batter/pitcher pairs can generate highly intuitive player embeddings.

Representation Learning @ Red Hat: Duplicate Detection

Red Hat is also exploring the applicability of representation learning for detecting duplicate support content. Here, “duplicate” doesn’t mean “exact copy”; rather, it indicates content that is conceptually redundant. Duplicate content can cause issues with information retrieval and introduce challenges when identifying support trends, so being able to effectively detect and remove duplicate content is important.

One strategy for duplicate detection is to look for similar LSA vectors, but there are a number of assumptions and design elements in LSA that limit its effectiveness. Specifically, the model (1) ignores word ordering, (2) implicitly assumes a Gaussian distribution on the term values by minimizing the squared error on the reconstructed matrix, and (3) assumes that the term values are generated from a linear combination of the latent document and term vectors. Neural networks relax these assumptions, which makes them a good candidate for learning semantic representations of documents.

But the question remains of how to use a neural network for duplicate detection without any labeled data. To do so, we adapted the Deep Semantic Similarity Model (DSSM) developed by Microsoft Research for the task (a Keras implementation of the model can be found on my GitHub). The original motivation for the DSSM was to improve the relevance of search results by mapping documents and queries into a latent semantic space and using the cosine similarity of these vectors as a proxy for relevance. Essentially, the model compresses the documents and queries down to their essential concepts. To adapt the model for duplicate detection, we simply used document titles in place of queries and trained an otherwise nearly identical architecture (though we did use internally trained word2vec embeddings instead of letter-gram representations for words). The semantic document vectors were then used to find conceptually similar content.

Conclusion

This blog post barely scratches the surface of representation learning, which is an active area of machine learning research (along with the closely related field of transfer learning). For an extensive, technical introduction to representation learning, I highly recommend the “Representation Learning” chapter in Goodfellow, Bengio, and Courville’s new Deep Learning textbook. For more information on word2vec, I recommend checking out this nice introduction by the folks over at DeepLearning4J.

 

Michael first developed his data crunching chops as an undergraduate at Auburn University (War Eagle!) where he used a number of different statistical techniques to investigate various aspects of salamander biology (work that led to several publications). He then went on to earn a M.S. in evolutionary biology from The University of Chicago (where he wrote a thesis on frog ecomorphology) before changing directions and earning a second M.S. in computer science (with a focus on intelligent systems) from The University of Texas at Dallas. As a Senior Software Engineer at Red Hat, Michael is constantly looking for ways to use the latest and greatest machine learning technology to improve and optimize business strategy.