Why relying on customer feedback is critical to building machine learning models
Imagine you’re taking a photo of your friends for your next Instagram post. Do you prefer a photo where they’re slightly awkwardly posed, or a candid “in-the-moment” shot that captures a genuine expression?
The majority of our customers come to Getty Images for exactly that: To find truly authentic, real images they can use on their social media and across their websites which read and feel “natural.”
Having repeatedly heard requests for more “authentic,” less “stock-like” imagery, myself and my data science peers set out to build a machine learning model to detect if an image is more likely to be considered authentic or not.
We started by working with our internal experts—a.k.a. our Creative Researchers—to create training data sets of more and less “authentic” imagery.
Samples from the more “authentic” training set:
Samples from the less “authentic” training set:
With the aid of thousands of these kinds of images, we trained a deep neural network via transfer learning to classify images into these two collections. Starting with an open source object detection model, we relaxed the higher-level layers and fine-tuned the model to our “authentic” versus “stock-like” case. We retrained the layers so the model would focus more on the overall aesthetic of the image including lighting and color, and “forget” object-specific attributes. To validate the results, we built a tool for our creative researchers to confirm the machine prediction or give feedback on which images the model was misclassifying.
We were curious how the transfer learned model had changed from the base model, so we used a saliency method to uncover which pixels the convolutional neural network was placing higher importance on. In comparison to the base object-detection model, our model was successfully considering pixels beyond just objects where the highly salient pixels were ones of people’s eyes (i.e. subjects looking at the camera) and studio light flares. The model seemed to have learned aspects of authenticity.
Currently, our image editors use this model in their manual review process by reviewing images that are more likely to be authentic in bulk, saving them time and helping them focus on the highest quality content.
We experimented with using the Authentic scores to boost content in our search algorithm, but we weren’t finding the desired levels of increased engagement and therefore set out to investigate. By turning to previous user research studies, we uncovered a treasure trove of specific descriptions of what our customers really meant by ‘“authentic” imagery:
- “Natural pictures, which look more like every day and less agency model”
- “More pictures with natural surroundings, not so placed”
- “Need more real images, no Photoshop”
- “Anything that is studio lighting, I won’t use”
- “I don’t want someone that looks too model-like”
- “In this image, he is holding his phone in an unnatural way”
- “I don’t like overly cheesy smiles”
- “It’s best if they aren’t looking at the camera”
As data scientists in consistent communication with creative experts, we knew that authentic and stocky imagery had multiple layers, but these specific notes from our customers made us rethink how to model that. We compiled what we were hearing into a list of “wants”:
- Diversity of people
- Natural environments/scenarios
- No posing
- No awkward facial expressions
- No looking at camera
- No cheesy smiles
- Not too model-like
- Capture people “in the moment”
- Lighting should feel authentic
- People shouldn’t be elitist
- Imperfect hair
- No cliches
- No stereotypes
As you can see, each of these is subjective, as something that looks authentic to one person, could easily look overly posed and stocky to another person largely based on their cultural background, upbringing, or the societal context they live in. For example, “no stereotypes,” could mean you don’t want to see a picture of a male CEO, but that stereotype could still sit in line with a perfectly “natural environment/scenario.” Recognizing this, we chose to optimize for eliminating the overly stocky images that most people would agree with and leave the nuanced cultural decision around “authentic” imagery to our users—in other words, we wanted to optimize for high recall of stocky images, so they didn’t bog down the customer experience.
With computer vision models, we have to rely primarily on the pixels, not the larger context outside of the image, so we focused on dimensions we could reasonably detect that were largely correlated with the broader concept of “authentic” and “inauthentic” imagery. In the current version of the model, we rely on a combination of features informed by what we were hearing from customers.
Knowing that customer input was critical, we brought them into the machine learning process early on. We showed them our training data sets before training a model, to help us improve those sets. For example, one customer pointed out that an image of a sun flare outside, while a natural phenomenon, felt unnatural to them because the photographer had emphasized the flare too much making it feel forced and unnatural. Knowing this changed how we consider light flare in the model and training data. In another customer conversation, we were showing an image of a lady who did not have a big toothy, cheesy smile. The customer noted that her expression and emotion was genuine, but her pose was unnatural for a garden setting. This informed how we were thinking about smiles in our model and training data.
So what does this all mean for the future of machine learning? First off, that there’s immense (untapped) potential when it comes to using machine learning models for highly subjective tasks. It also means that putting results of early versions of models in front of customers is critical to get the necessary feedback to inform future features and model selection.
Here’s one more lesson learned: As a data science team, we find that by working closely with our user researcher counterparts, we can more deeply understand what the customer expectation is and dig one layer deeper into what the customer is reacting to when a result is not meeting their expectations, to find insight that informs our machine learning models. The real magic, as we like to say, comes from sharing detailed nuggets of those customer insights, as opposed to rolled up top level concepts, because the language customers use in describing their expectations informs how the data scientist thinks about the problem, too.