This Week in Data by Braxton Mckee (CEO, Ufora) & MLconf Alumni Speaker, Issue #9

Kick Off

Every day, data models run by various companies are putting you into various categories. Rarely do you get to see their findings on you. So, it’s kind of cool that there’s a way to get inside Facebook’s data model. People are paying attention to this because of the media firestorm over how Facebook’s news feed algorithm seems to be biased against conservative political news. You can go to facebook.com/ads/preferences and see exactly what the social media company thinks about you. (Full explainer here.) As more companies integrate data models into their businesses, I expect we will see broader push – perhaps, legislation even – requiring this sort of transparency.
We also continue to see Facebook dealing with backlash over possible political leanings in the news feeds. On Friday the company announced it is reducing the role of human editors even further, in favor of having algorithms select trending topics. It’s fascinating that tech companies, on the one hand, get to reduce concerns about subjective human choices by pointing to their machines and models but at the same time, people should remember that at some point, humans input some of the rules into many of these models.

In the News

“Is big data in big trouble?”, asks TechCrunch. The answer is that some data companies like Tableau are overhyped and now as they miss their earnings forecasts, some investors are disappointed. At the heart of this, I think, is an over-investment in Hadoop-backed technologies. Hadoop moved the needle a lot in data analysis, but even as it is becoming outmoded (by Spark and other platforms that work better on distributed computing) companies are still doubling down on Hadoop. I think the Hadoop train will come to a halt at some point.
On Amazon Kinesis: Analyzing streaming data is a crucial part of bringing machine learning to bear on real-world problems. Amazon has had a system called Kinesis for a while now, but they officially took the “Beta” label off of it this week. Amazon’s done a good job at servicing both developers (by allowing the them to use Kinesis infrastructure to build powerful applications) and also business analysts, by allowing users to query data streams using SQL in realtime. The ideas in Kinesis aren’t new, but Amazon’s implementations are rock-solid and they’ve done a good job integrating the service with all of the other infrastructure services that Amazon provides. A nice writeup here.
It’s hard competing with the biggest tech companies. So Rackspace, an early cloud-computing company that went public in 2008, has found. This week it went private again.

In Industry

I’ve been following self-driving cars with interest in this newsletter because I think it’s one of the most visible ways we’re seeing artificial intelligence and data models turn into products that will be right in front of normal consumers. So, I wanted to note that the the world’s first network of self-driving taxis began this past week in Singapore. How this network fares will be held up as an example in many driverless car debates, I’m sure.
You all have experienced the clogged-up Internet, where downloads take forever. One driver of the problem is all the images and streaming movies we are watching. Cool to see how Google is looking to artificial intelligence to work on better image compression.
This is a great series of charts that shows you which companies and which industries are investing the most in artificial intelligence research and patents and which types of AI are most commonly being pursued. Fujitsu? Who knew.

In Research

A cool paper on using data on two effects to figure out whether A causes B or B causes A just from the data.. Normally this is hard to do because you have to actually run experiments to see if A is really causing B or just correlated with it. This paper contains some new techniques for inferring it from the data itself by seeing how noise in A affects noise in B.
The professor, José Daniel García, at the University of Madrid, is working on some cool stuff: his REPARA project is working to automatically rewrite programs so that they can be run on multiple graphics cards simultaneously. This has been a consistent interest of mine: how can we build systems that take software that’s already written, figure out how the programs work, and then automatically rewrite them so that they run much faster? I think of it as AI for computer programming itself.
I’m finding myself increasingly interested in statistical natural language processing, where computers try to understand written text by processing huge volumes of data (like news articles), but without any prior knowledge about language or grammar. So, I was excited to see that the Google Brain team released model code for their cutting edge news article summarizer. The models are constructed in TensorFlow, Google’s open-source machine learning framework. It’s great to see so much great AI research being done out in the open!
I really liked this technical blog on optimization techniques. In particular I loved the following graphics by Alec Radford showing how different methods of optimization interact with different kinds of functions:

Animations that may help your intuitions about the learning process dynamics. Top: Contours of a loss surface and time evolution of different optimization algorithms. Notice the “overshooting” behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. Bottom: A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed. Images credit: Alec Radford.

Quirky Corner

Dumbing down data – One of the risks of bubbles in certain parts of our economy is that you get a bunch of people working in those areas who become “experts” but, in fact, they don’t know what they are doing really and would be quickly exposed as charlatans if there were less money floating around that area. There’s a degree to which I think this has happened in the data space. How many “data scientists” are there now who wouldn’t have passed an advanced math degree program. This book by Cathy O’Neil, a former math professor at Barnard, exposes lots of the shoddy non-mathematics based thinking that gets used by data scientists catering to businesses and makes an argument that the mainstreaming of data leads to bad results for society, including more inequality. Worthwhile read.
Keep some skepticism when you hear businesses pitching you on fancy new types of things you are doing with fancy new labels. Om Malik, the tech writer and founder of GigaOm, had a good essay in The New Yorker this week about how all these new terms – “artificial intelligence,” for instance — are just a continuation of other things we’ve long been doing.
Biking, not golf — I often marvel at how many fewer of my peers golf than in my parents’ generation. So check out this article, which says that cycling is the new golfing for the tech industry. Totally prefer two hours on my bike than time on a golf cart.

What’s happening at Ufora

My colleague Ronen Hilewicz gave a talk at the Women in Machine Learning and Data Science meetup in New York about scaling up machine learning algorithms. His main point? You shouldn’t have to rewrite your program to get it to work with huge datasets. You can watch his talk here. (Password is 5LjHhe6N)
We are up to our necks in consulting work so I am changing this newsletter to become a monthly update. That means I’ll bring you the very best stuff I see at the end of the month. I’m excited to correspond with any of you individually anytime.

Braxton McKee is the technical lead and founder of Ufora, a software company that has built an adaptively distributed, implicitly parallel runtime. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.

Kick Off

In the News

In Industry

In Research

Quirky Corner

What’s happening at Ufora

Code of Conduct

Refund Policy

Press Inquiries

Kick Off

In the News

In Industry

In Research

Quirky Corner

What’s happening at Ufora

About the Author

Don't miss a thing!