Mirages in Data | MLconf - The Machine Learning Conference

For a few years I lived in a place whose moniker was ‘The Valley of the Sun.’ It was hot. On particularly hot days, mirages would appear in the heat brutally radiating up from the earth. The hotter the day the more intense the mirage. Mirages exist in data and similarly, the more intense the pressure, the more likely they are to appear.

The pressure that creates mirages in data comes from:

Controversial and impactful topics
Leadership stating a position or desired result
External results supporting leadership’s statements
Desire to advance your career

The result is conclusions being drawn from data that are not supported. Alrighty, if that’s the recipe and they exist, let’s see if we can create one. I teach an Applied Machine Learning class at UC Berkeley with 15 students, some of the hardest working and most talented students I’ve worked with. With the student’s permission to share, this is what we did in one session – they were unaware of being in an experiment at the time.

As the lecturer. I kicked off a conversation with “In tech, why are a higher percentage of men promoted to leadership positions than women?” The students thoughtfully explored the question.
As the leadership in this situation, I then shared past research I had done in this area.
External sources were cited, reinforcing the differences.
A competition was then setup amongst the students. They were given a valuable and unique data set of a company’s payroll, performance evaluations, and employee gender. In groups, they were told to explore the impacts of gender and formulate a conclusion. Students would then vote on what the best conclusion was. The group with the most votes would ‘win.’

The groups took a variety of approaches including linear regression and visualizations of decision trees. Their approaches were standard and well coded. The conclusions were:

We found that women had greater bonuses and greater managerial assessments compared to men, and there was a larger effect on the bonus for women. [4 members, 2 votes]
We could not verify a gender difference. [4 members, 0 votes]
Just based off of salary and bonus alone, we could predict the gender of the employee. [with 100% accuracy] [3 members, 9 votes]
We did not see that gender was a significant factor. [3 members, 3 votes]

The winning conclusion was ‘Just based off of salary and bonus alone, we could predict the gender of the employee.’

Here’s the gotcha. The data was fake. I generated it with this notebook – gender is randomly assigned. Most members of the two groups with the right conclusion – no conclusion was the conclusion – did not vote for their own result. The shock on the students’ faces was heart wrenching. I doubt they’ll forget that lecture and I hope it gives them the confidence in future to push back and refuse to see mirages in data, even when leadership wants them to.

Let’s reflect. Students of data science curriculums are handed dataset after dataset containing known conclusions. This exercise shakes students out of the bubble of expecting to find conclusions in data. The catch-22 of this exercise is student’s work improves whether or not they know some data in the curriculum is faked! If they don’t know and this exercise is performed then they learn the above lesson. If they DO know at least one exercise will contain fake data, they will always being asking – is this the fake data set? The uncertainty will force students to be more thorough in their analysis and start with the question, should a conclusion be drawn from this data?

When it comes to industry, the follow on actions are less clear. Should the ability to say ‘the data does not support that conclusion’ be valued at the level this exercise becomes an interview question? What are the time savings, computational resource savings, and how many outcomes would change if ‘no conclusion is the conclusion’ became more accepted? Is there value in a mirage giving leadership fake confidence in their beliefs, so they can confidently execute towards their vision? If an individual data scientist reports no conclusion is the conclusion – won’t they just be replaced by someone who can see the mirage at a cost to their career? On any particular instance, fighting a mirage may be a fool’s endeavor. But across the data fields, I believe we need to hold firm, or when we do see a legitimate oasis to guide our companies towards, leadership will say no – that’s a mirage.

Photo Credit: vs148/Shutterstock.com and dkfindout.com

Appendix. One of my favorite aspects of data science is how it can unite ideas from a variety of fields. Here are links to more in-depth treatments of ideas touched on in this post. The concept of setting an expected conclusion is core to cognitive bias in psychology. A specialization of cognitive bias, anchoring, explores setting expectations. The Data Science Ethics Podcast on anchoring nicely covers the definition, impact, and high level action items. Asking multiple groups the same question and getting different answers is replication, which affects psychology, biomedicine, and potentially all scientific fields.

I encourage replication of these results with additional data science classes and workshops. The generating notebook is included and a sample data set. If you are interested in running this exercise please stay in touch – as I would love to hear about your variations and results!

Code of Conduct

Refund Policy

Press Inquiries

About the Author

Don't miss a thing!