Our past Technical Chair, interviewed Agustín Azzinnari, Research Engineer at Tryolabs, on some of his recent work with deep learning techniques which were applied to perform product categorization over a large hierarchical taxonomy for the leading e-commerce site in Latin America.
In a recent paper Google claimed that training models together is better than ensembling them. You train RNNs for product text descriptions and CNNs for product images. Have you seen an improvement by co-training the models?
AA) Both approaches are sensible, and depend on the tradeoff one is willing to accept between model performance and computational resources.
The first option (i.e. training each of the networks independently and then performing a weighted average of the results) will be faster, as the objective is simpler and the training procedure will both have smaller networks to work with and use less data. On the other hand, the resulting model will not be able to leverage the combination of product image and titles, as each network is independent of the other.
The second option will naturally be slower, as the resulting network will be bigger and more data will have to be processed, but it allows the model to take decisions based on both the image and the title. This means, for instance, making sense of a word (e.g. a brand name) based on the contents of the product image.
So, while early tests showed an improvement in accuracy, these results were ultimately offset by the additional computational resources needed, so we opted to train to independent models and combine their results with a weighted sum.
Both text and image data have a spatial dependency. Why do you think Convolutions work better for images and RNNs for text? Have you tried RNNs on images and and CNNs on text?
AA) RNNs are focused on processing sequential data, where the input may be consumed one step (or more) at a time. Under this scheme, RNNs are a natural fit for natural language, as they are able to extract long-lived dependencies over the data.
CNNs, on the other hand, act as filters that can process images at a higher level, allowing the extraction of complex features from pixels. The spatial locality present vision tasks is exploited by the convolutions to generate dense representations of different parts of an image.
That is not to say that they cannot be used the other way around. In fact, RNNs have been used indirectly for image processing (for instance, when generating captions for images or when processing videos), and CNNs for natural language processing (e.g. on the famous paper by Collobert & Weston et al., which attempts to solve several NLP-related problems through the use of word embeddings and CNNs).
In the task at hand, we found no need to try these other approaches, as we were performing a straightforward product classification.
Embedding algorithms like word2vec seem to produce some interesting visual semantic rules like King=Male+(Queen – female). Do you have interesting stories to share about supermarket products?
AA) Sadly, linear regularities such as the ones seen in word2vec are not so clear, mainly because the resulting embedding space is much more complex when dealing with products.
One related result that proved interesting, however, was the embeddings built by using users’ navigational patterns as input to word2vec. By building sentences of products visited by users (e.g. each word is a product the user visited, with the sentence being a given session) and applying word2vec over them, we obtained very good representations for products.
These representations look exclusively at how the products are navigated, but nonetheless get to capture a great deal of information about the items. Products that are usually visited one after the other will have similar representations (e.g. all iPhones will be nearby on the embedding space), which can then be used as input to additional models.
Deep Learning embeddings seem to produce interesting semantic relations. What is the practical application of those embeddings? How do you translate them to revenues?
AA) Embeddings are usually just a byproduct of the deep learning models themselves, but that doesn’t mean they cannot be used for different applications in a sort of transfer learning scheme.
For instance, if we were to take the weights of the last layer of a network used to perform product classification, these vectors can act as embeddings for the different product categories, where similar categories will have similar representations. This could prove useful to, e.g., recommend additional items or categories to the user.
We can also look at it the other way around, taking the representation of a product after passing through all the layers of the networks and looking and the representation on the last layer. This mapping of the data into a lower-dimensional space will be a dense representation of the items that holds nice properties (such as similar items being located nearby) which can be used for other purposes.
Of course, translating them into actual revenue is a matter of finding the right application for them, whether it be product classification or recommendation.
You seem to use navigation patterns to augment your embeddings. Any thoughts about using reinforcement learning for taking actions to interact with the customer?
AA) It really depends on what you want to do. Reinforcement learning is often seen as a way for an agent to perform actions in an environment according to a certain policy, which is useful when the environment behavior is difficult to model or unknown (e.g. when one cannot predict how it will respond to a stimulus).
Dealing with users is usually more straightforward, as only one aspect of the interaction is considered at a time. For instance, in the case of recommender systems, the objective to model are the users’ likes and dislikes, and a direct way to measure the model’s prediction quality could be simply the increase in sales. In this case, we don’t need to learn a policy that can later be tested and evaluated against the user, as we (probably) have the historic data at hand.
Agustín Azzinnari is a Research Engineer at Tryolabs, a company dedicated to machine learning consulting and development. He currently leads the development of core NLP and Machine Learning functionalities for one of the company’s biggest teams. He holds an Engineering degree in Computer Science with a strong focus on AI and NLP. His areas of research include distributional representation of words and clustering on big data.