Kick Off:
There’s a new chip out from scientists at MIT. The “Swarm” chip can speed up some algorithms many hundreds of times, and it does it by automatically splitting programs into many tiny pieces that run simultaneously on its “swarm” of processors. The best part is that programs don’t need extensive rewriting to use the new chip. This is an amazing accomplishment: normally, programmers have to completely redesign a piece of software to get these kinds of speedups.
Swarm caught my eye because some of the technical approaches it uses are similar to approaches we’ve implemented in Pyfora. In both cases, the goal is to achieve huge speedups without programmers having to do a lot of work. In Pyfora’s case, we did it by taking regular Python programs and disallowing certain kinds of operations that are hard to speed up. In Swarm’s case, they built a new chip architecture that’s incredibly efficient at handling some of these same issues.
I don’t know how long it will be before Swarm chips are commonplace, but there are some practical ideas in their implementation that I plan on incorporating into my own work in Pyfora. More importantly, our approaches are somewhat complementary – so if they do make Swarm chips available, running Pyfora on top of it will produce some truly amazing results. Congrats to the MIT team – it’s some beautiful work! There’s a very nice technical writeup about it here.
In the News:
The Russian government is trying to record everything going across the Russian internet and wants telecom and internet companies to make all user communications available to the government. Not only are there ethical questions, it’s quite doubtful that this is even technologically possible. It’s not clear that Russian ISP infrastructure can store so much data, and most web traffic is encrypted by software not in the control of the ISPs, so they can’t decrypt it. I’d hate to be the federal security agents that Putin just mandated get this done within two weeks!
You hear about Internet security a lot. And if you read this newsletter regularly, you hear about Quantum Computing a lot. They may not seem connected, but if Quantum Computing becomes a reality, it is so powerful that it will make it possible to decrypt secure internet transmissions. Google, of course, is preparing for that early. Here’s a good Verge story.
Snapchat is building in more advanced photo search to its tools. This newest one has “object recognition” in it. The interesting point: it runs on your phone, not in the cloud, which will be a major selling point for Snapchap users who have very private photos on their phones. What I think is interesting is this is one of the first uses of the new deep learning technology running on a phone. Google extended their machine-learning technology TensorFlow to run on iOS relatively recently.
In the world of open source software, Mozilla – the maker of the Firefox browser- is building a tool to to ingest the web-link-graph and provide recommendations. This is a bit like Google without an explicit search function. It will instead deduce your searches and proactively give you recommendations. Since Mozilla is driven by desire to keep the web open, it makes me wonder who will own all this data processing infrastructure and the data that comes out of it? Will that be open too? In any case, I am sure there will be some exciting technology to come out of this project, and since Mozilla is so committed to openness, we will hopefully all get to use it!
In Industry:
Can data solve the same-day grocery conundrum? It’s become common wisdom that it’s hard to turn a profit with same-day grocery delivery and many a start-up has failed at this. But Instacart says that data analysis is the key. Here’s a good interview with Jeremy Stanley, Instacart’s vice president of data science, about why.
Genetic data can be very helpful in medical research and lucrative, too. That DNA company – 23andMe – that was started by the ex-wife of a Google founder has been selling data about its customers to drug companies, MIT Technology Review writes.
Unrelated to 23andMe, Microsoft is experimenting with using DNA as a storage device. That means DNA might someday replace flash memory in your computer’s hard disk. Interesting stuff.
You may not think about it while you are jamming out, but data analysis is behind much of the music industry’s recent come back. This story does a good job explaining the Musical Genome Project, which turned music into structured data and more recent steps by Spotify to “deconstruct, analyze, and categorize music.”
Quirky Corner:
Just look in people’s eyes. That’s what Google plans to do with technology it’s building to use artificial intelligence to spot common diseases simply by scanning eye balls.
Uber as the new big brother? The company will use data to monitor its drivers.

Braxton McKee is the technical lead and founder of Ufora, a software company that has built an adaptively distributed, implicitly parallel runtime. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.
This Week in Data by Braxton Mckee (CEO, Ufora) & MLconf Alumni Speaker, Issue #4
Kick off:
The ethics debate on artificial intelligence is heating up. You’ve got all the big tech leaders chiming in on it. This week, Satya Nadella, Microsoft’s CEO wrote in Slate that “the most productive debate we can have isn’t one of good versus evil: The debate should be about the values instilled in the people and institutions creating this technology.” And Eric Schmidt wrote an essay in Fortune basically saying, don’t freak out.
One of the ways people are trying to get around AI gone wrong is by saying that AI is just going to support things we already do, so that a human can ensure the AI is behaving. I think this kind of control will prove to be an illusion: when things are going wrong, AI will have to make decisions without a human check and these are the times when the choices can be the hardest. A good example is what happens in difficult decisions with driverless cars? What happens when a Tesla gets in a wreck, which occurred this week? What should a driverless car do if it has to choose between saving its driver or saving bystanders? This is a hard, foundational problem for AI: how can it make choices that are inherently moral? In particular, because there is a time constraint during a crash, it doesn’t have a chance to ask a human what to do. MIT has a good new study out in Science Magazine about people’s preferences for what driverless cars should do. The researchers found that the public generally has a utilitarian point of view on the matter. Good explainer here.
This Week:
Many banking regulations in the last twenty years have required financial institutions to dig into the backgrounds of their customers. One way they did that was by subscribing to databases that track suspected terrorists and other people. This week one of the major databases of that information leaked, reminding everyone using data in their businesses that it might be hacked any time. Now that’s some sensitive data.
Speaking of sensitive data. Stolen medical data is trading on the dark web. I spoke on a panel organized by Swissnex on Thursday night, and another panelists argued that “data itself has no intrinsic value. It’s like air.” Illicit datasets trading at $400,000 on the dark web, however, seems to argue the opposite. Maybe data is only truly valuable when very few other people have it.
In the most basic form, companies “collect” data from users all the time. And Facebook won an important ruling in Belgium allowing it to continue collecting data from people who are not users of its site. This has big repercussions for many different sorts of web sites.
Along these lines, Google took a step this week towards showing us what data it has about us.
In Industry:
Data in jails. On Thursday, the White House launched a justice initiative to oversee how data is used to figure out who should be in jail, and ultimately to avoid overcrowded jails. This is bubbling up at the state level, too. In Wisconsin the Supreme Court is set to rule on a case about whether a computer algorithm can be used to determine likelihood of repeat offenses, a factor in sentencing decisions.
All the mapping applications out there – everything from Waze to many car navigation system – got news this week. Google is adding lots of new satellite data to Google maps.
For data scientists in every industry: here’s a great blog about a new ML technique called lda2vec for summarizing text in a way that’s not only usable by computers but where the model results can be interpreted by people. It’s a great writeup and has some nice diagrams that give you a good sense of how this stuff actually works.
Quirky Corner:
U.S. Customs wants to know your Twitter handle.
What’s happening at Ufora:
I was part of a couple great gathering this week. On Tuesday, I presented at the Artificial Intelligence meetup in New York. It was a great crowd interested in Pyfora, our open source data platform. The other talk was by neuroscientist Jeremy Freeman who gave a great overview of recent advances in neural nets. On Thursday, I joined a smart panel about the use of data and data science in in finance hosted by Swissnex. Based on the audience questions, I’d say that anxiety around privacy and the ethical use of data is running high.

Braxton McKee is the technical lead and founder of Ufora, a software company that has built an adaptively distributed, implicitly parallel runtime. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.
This Week in Data by Braxton Mckee (CEO, Ufora) & MLconf Alumni Speaker, Issue #3
Watching BREXIT odds live on “betfair”. Here, predicting a slight edge for “exit” a few minutes before the clinching votes come in.
Kick Off:
It was a crazy end of the week for many of our clients because the markets went into turmoil after the Brexit vote. Above is a screenshot I took Thursday night while the votes were being counted. It shows the betting markets. The results were predicted there many minutes before the BBC called the election. It’s interesting because we often use data to try to predict asset prices, but here, the prices themselves can be predictive of future events. Currently, the betting markets have Hillary winning the election at slightly less than 75%. Given what I saw on Thursday, I think this is where I’m getting my live election coverage!
In the News:
Kudos to engineers at the University of California Davis who announced this week their creation of the world’s first 1000 processor chip. This will have a great capacity for crunching large parallel datasets.
Social security payments for robots? Yep, only in Europe. There’s a draft motion in the European Parliament that would require companies that hire advanced robots to pay social security and other benefits on them. Given how many companies are using robots to cut costs, this sort of movement could change that equation. I just want to know: who gets to collect the benefits?
In Industry:
So as a data practitioner, I generally lean towards preferring to have more data about things. But there are times the consumer in me comes out, and this instance of a company tracking 100 million phone users without consent is one of those times. It’s great to see the rules being enforced.
Dr. Google is here. 1% of all searches on Google are medical symptoms and Google said this week that it will begin giving out health advice based on what you’re searching for.
It’s not just doctors who should fear what artificial intelligence is going to do to their jobs– it’s also housekeepers. Elon Musk’s A.I. Researchers are creating robots to do household chores.
I’ve been following the news out of the LIGO Observatory in Louisiana. This year, this big news has been the two gravitational waves the observatory discovered, the first observations of their kind. This week, the observatory said that the waves came from two big suns that formed 12 billion years ago. The part I find most interesting is the complex data simulations these researchers are using to figure all this out. They use what is called a “synthetic universe” which is a computer model that maps out stars colliding. This turns out to be hard to do – the wave-detector produced a bunch of data that researchers had to analyze, but before the synthetic universe model, it was not very clear how to analyze the data since we don’t know what we’re looking for. “Synthetic universe” is neat because it models both the stars colliding, and also the detector – that way we can tell what we should be looking for in the detector! It’s a great example of how some datasets require really sophisticated techniques to make sense of them.
A shout-out to the Fintech Innovation Lab, which Ufora was a part of last year. The NYC initiative graduated a latest group of companies this week. You can read about them here.
Quirky corner:
Let’s just reflect on the name of the machine learning company that Twitter just bought: Magic Pony Technology. Yes, that’s right. Maybe I need to rebrand.
What’s happening at Ufora:
I’ll be speaking at two events this week, so please come out and support the home team. First, I’m giving a talk about our open-source technology Pyfora at the New York A.I. meetup on Tuesday. The other speaker is a neuroscientist and his talk looks pretty great, so this should be a fun evening. Secondly, I’ll be a panelist on The Power of Big Data, an event hosted by Swissnex. You need tickets for this one, and I have a few, so let me know if you’d like to attend.

Braxton McKee is the technical lead and founder of Ufora, a software company that has built an adaptively distributed, implicitly parallel runtime. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.
This Week in Data by Braxton Mckee (CEO, Ufora) & MLconf Alumni Speaker, Issue #2
Kick Off:
As the world grieved the shootings in Orlando, an important ethics question is gaining attention in the technology and data world. ISIS has been very active in social media and other online sites recruiting followers, so how can data modeling be used to detect impending terrorist attacks? A study released in Science Magazine on Thursday provided some interesting answers. The researchers analyzed data collected from websites where ISIS is active. They were able to identify escalation in activity before ISIS attacks. This piece last year in the MIT Technology Review gives an inside look at how ISIS uses the Internet, and I suspect we will see more examination of this.
Given that people affiliated with ISIS are starting to livestream on Facebook after their attacks, it would be interesting to see if their online activity before the attacks would have given hints of their planned actions. It seems only a matter of time until the technology industry is asked to play a bigger role in this fight. With so much of it playing out online, data will play a key role.
In the News:
There’s the big Microsoft acquisition that everyone’s talking about — its $26 billion purchase of LinkedIn – but there’s another smaller one that’s interesting too. On Thursday, the company announced it had acquired a 7-person company called Wand Labs. This is a company that specializes in using voice-commands to run apps, removing the need to tap open and operate apps on your smartphone. It’s part of a broader push in the tech industry to use bots to let users control things by talking (think of Amazon’s Alexa, Facebook’s Messenger and Microsoft’s Cortana). But it’s also part of a transition in the industry away from experiences that are set within one app, or even within one device, into experiences that criss-cross through all the technology you are using. As this good Fast Company analysis of the Wand acquisition says, it’s about integrating “disparate apps and services via a conversational layer.”
It’ll be interesting to see how the different tech giants approach this transition into a more integrated world. As Farhad Manjoo of The New York Times smartly noted of Apple this week, Apple has prized its devices, but “many of its competitors have been moving beyond devices toward experiences that transcend them. These new technologies exist not on distinct pieces of hardware, but above and within them.”
The way all of this new technology works, of course, is through artificial intelligence that is trained on masses of data. To me, one of the best ways to follow this trend will be through watching the company Viv. What is Viv? Read this great John Battelle essay.
On Thursday, Google announced a new artificial intelligence and machine learning research center. Though Google is already doing lots in this space, this heralds a bigger, more coordinated investment. Hopefully, all this investment will result in new open-source technology for the broader machine learning community, as it did with Google’s powerful technology TensorFlow.
$50 million just disappeared. On Friday, some of the people warning about the reliability of a new virtual currency project were proven right when a hacker zipped away with $50 million from it leaving only a taunting message. The Decentralized Autonomous Organization seems to be a cautionary tale now. After computer scientists warned about the holes in the currency system in May, one of its founders told Nathaniel Popper of The Times: “Of course this venture is fraught with risks” but “this technology represents the future of the Internet.”
In Industry:
Clean data. It’s easy to get wrapped up in all the cool ways we can model data and make it run faster. But the data itself that you input is so critical to the reliability of your findings. This is a thoughtful piece on how to think about the data that you input and how to help it be as clean as possible.
Transportation & AI:
You’ve heard a lot about driverless cars. But how about driverless driverless buses? With this, which the robot not only drives the bus but also serves as tour guide. That’s what IBM is experimenting with in Washington DC and Miami, and not only is it AI-driven, the buses are also printed on a 3-D printer. A full explanation here
Also…
Data and simulation are widely used in industrial engineering to design physical objects. A new advance in computational fluid dynamics (which is used to design things that interact with fluids, like propellers or chemical plans) makes it possible to more accurately simulate the physics at the boundaries between fluids and other objects. It’s easy to get excited about all this flashy new artificial intelligence. But we shouldn’t forget that massive computing power and data analysis have been driving industrial processes for a long time, and advances here can be quite valuable.
Quirky Corner:
Facebook is about to start tracking what stores you go into. This is so that advertisers get a sense of whether you are buying something after seeing their ad. Turn off the location services part of the Facebook app if you don’t want to be a Facebook data point.
And Amazon is working on training its virtual assistant, Alexa, to recognize emotions. So even if the humans aren’t comforting you when you are distressed about something, perhaps Alexa will.
What’s happening at Ufora:
Our colleague Alexandros Tzannes, who normally works remotely, was in town last week at our offices here in NYC. In addition to working on a number of our client projects, Alexandros spearheads our GPU computing effort (you can see some of his recent code here) which is making some exciting strides and will be ready for wider consumption later this summer. It was great to have him in New York!

Braxton McKee is the technical lead and founder of Ufora, a software company that has built an adaptively distributed, implicitly parallel runtime. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.
This Week in Data by Braxton Mckee (CEO, Ufora) & MLconf Alumni Speaker
Kick Off
There’s been a lot of handwringing about the algorithms driving what we see related to political news. First, a month ago, there was concern that Facebook’s news feed results were biased against conservative news, and this week, there’s concern that Google favors Hillary Clinton in its autocomplete suggestions in its search engine. (See this YouTube video that went viral on that. The video shows that Google seems to have suppressed the appearance of “Hillary Clinton indictment” in favor of “Hillary Clinton India,” even though data shows people search for information on Clinton’s indictment more than information on Clinton and India. It points out that the executive chairman of Google’s parent company, Eric Schmidt, is a big Clinton supporter and that Google has many ties to her as well.)
Search engines and algorithms decide what’s relevant on these sites in very complicated ways, and the public generally doesn’t know when it gets tweaked. Generally speaking, people want artificial intelligence. It’s making life much easier in many ways. But who is to say that AI isn’t going to conclude that one candidate is just better. Any given AI may simply conclude that Trump or Clinton or Sanders, for that matter, is unfit to be president. We live in a world where the media tries to cover candidates equally (may not seem that way with some outlets, but generally there is an effort at media companies to give equal coverage). An AI model wouldn’t generally be programmed to bend over backwards to treat candidates evenly. Though, apparently, Google’s search engine is programmed to never autocomplete to something related to a crime, which some say is why “indictment” doesn’t come up with Hillary Clinton.
My view on all this is that these news feeds and searches are a type of public utility and that it is unacceptable for corporate agendas to tinker with the results we see to further their own interests. That said, I do not think Google and Facebook need to go into the algorithms and change them in ways to make them seem unbiased. If the behavior of the public in search queries drives these algorithms to give biased results, so be it. But it’s got to be based on the public, not on corporate interests.
In the News
Google announced that it had made progress in quantum computing. Interestingly, it’s a different technique, called “analog quantum computing” that tries to borrow less from regular digital computing in its pursuit of the new technology. To me, it’s also a tacit acknowledgement that “D-Wave”, the quantum computing company that Google invested in, has failed to live up to the hype. This article says as much in passing.
Microsoft released a research paper showing that it has figured out how to predict whether people will have pancreatic cancer based on their search queries. This reminds me a bit of when Google studied search data to predict flu outbreaks. But the difference here is the Microsoft research gets at something far more personal. It also raises an interesting ethical question: what should Microsoft do if it’s models predict someone has cancer? Is Microsoft obligated to tell them? Good summary here.
It’s somewhat amusing watching the lions of technology stoop to personal insults to support their views about artificial intelligence. As you all have likely read, Elon Musk is concerned that AI may ruin the world. On the other side of the debate, Eric Schmidt says that’s not going to happen. So this week, what did Schmidt say to get a leg up on their dispute? That Musk is not a “computer scientist” and is only an “engineer,” so he doesn’t know what he’s talking about.
In Industry
In our open-source corner of the world, Doug Cutting, the creator of Hadoop, gave an interview this week where he talks about the performance improvements we can expect to see from “XPoint,” a new memory chip being produced by Intel that will allow much faster access to much larger datasets. Cutting talks about how Cloudera and Hadoop will benefit from this new hardware. But these new chips are part of a steady trend towards big-data computing where data resides “in memory,” where it is faster to access than when it’s on disk. Personally, I can’t wait to run the open-source data platform I work on (Pyfora) on top of this hardware!
Chip wars: NVIDIA has been getting a lot of positive attention for its chips, and I’ve written in the past about the shift from CPUs to GPUs (a big marker in that story was earlier this year when a computer beat a world champion at Go). Something to watch for now are hybrid chips that combine features of the two kinds of computing devices, such as the Intel Xeon Phi. Here’s a thoughtful essay on the trade-offs in all these kinds of chips and where the industry is heading.
In cancer research, data analysis is coming up a lot. This week, UCLA researchers announced a method for using genetic sequences to more accurate tell cancer patients how their cancer is likely to turn out. And Vice president Biden spoke at a national oncology conference about the need for data sharing to crack the code on curing cancer.
Quirky Corner
There’s a new movie out, completely written by artificial intelligence. And, coming soon, is AI songwriting.
Tesla now knows whether an accident is your fault. I bet it’s not long before all cars do. This will be potentially a big change in how the police and insurance companies handle blame in car collisions. So, that 1990’s rust-bucket you have parked in your driveway may break down occasionally, but at least it won’t tattle on you.
What’s happening at Ufora
I was interviewed on the Talk Python podcast about our work on auto-scaling python programs to thousands of cores using Pyfora. The show’s host, Michael Kennedy, asked me some great questions about the technology inside of Pyfora, and some of the work we’re doing now to speed up complex learning algorithms.
Also, we’re excited that as of this week, the fine folks at MLconf will be sharing this newsletter with their audience. Welcome MLconf fans!

Braxton McKee is the technical lead and founder of Ufora, a software company that has built an adaptively distributed, implicitly parallel runtime. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.
MLconf SEA Speaker Suggested Papers
We recently asked the speakers of MLconf SEA 2016 to share their favorite papers with the MLconf audience. We hope you find this list interesting and educational!
Avi Pfeffer, Principal Scientist, Charles River Analytics
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund University, Germany
Open-World Probabilistic Databases
Ismail Ilkan Ceylan, Adnan Darwiche, Guy Van den Broeck
http://web.cs.ucla.edu/~guyvdb/papers/CeylanKR16.pdf
Incremental Knowledge Base Construction Using DeepDive
Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, Christopher Ré
http://www.vldb.org/pvldb/vol8/p1310-shin.pdf
Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization
Martin Jaggi
http://m8j.net/math/revisited-FW.pdf
Deep Symmetry Networks
Robert Gens, Pedro M. Domingos
http://homes.cs.washington.edu/~pedrod/papers/nips14.pdf
Admixture of Poisson MRFs: A Topic Model with Word Dependencies
David I. Inouye, Pradeep Ravikumar, Inderjit S. Dhillon
http://jmlr.org/proceedings/papers/v32/inouye14.pdf
Florian Tramèr, Researcher, EPFL
Adversarial Learning
Lowd & Meek, KDD, 2005
http://research.microsoft.com/pubs/73510/kdd05lowd.pdf
Practical Evasion of a Learning-Based Classifier: A Case Study
Srndic & Laskov, IEEE S&P, 2014
http://www.utdallas.edu/~muratk/courses/dmsec_files/srndic-laskov-sp2014.pdf
Can Machine Learning Be Secure?
Barreno et al, ASIACCS, 2006
http://www.cs.berkeley.edu/~tygar/papers/Machine_Learning_Security/asiaccs06.pdf
Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing
Fredrikson et al, USENIX Security, 2014
https://www.usenix.org/system/files/conference/usenixsecurity14/sec14-paper-fredrikson-privacy.pdf
Jason Baldridge, Associate Professor of Computational Linguistics, University of Texas at Austin
A Supertag-Context Model for Weakly-Supervised CCG Parser Learning
Dan Garrette, Chris Dyer, Jason Baldridge, and Noah Smith. (2015)
https://aclweb.org/anthology/K/K15/K15-1003.pdf
Hierarchical Discriminative Classification for Text-Based Geolocation
Ben Wing and Jason Baldridge. (2014)
http://aclweb.org/anthology/D/D14/D14-1039.pdf
A recursive estimate for the predictive likelihood in a topic model
James Scott and Jason Baldridge
https://github.com/utcompling/topicmodel-eval/blob/master/scott-baldridge-aistats13.pdf?raw=true
Amanda Casari, Senior Data Scientist, Concur Technologies
Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research
Eamonn Keogh & Jessica Lin, Computer Science & Engineering Department University of California – Riverside {eamonn, jessica}@cs.ucr.edu
http://www.cs.ucr.edu/~eamonn/meaningless.pdf
Antisocial Behavior in Online Discussion Communities
Justin Cheng , Cristian Danescu-Niculescu-Mizil , Jure Leskovec, Stanford University, Cornell University
http://arxiv.org/pdf/1504.00680v1.pdf%20
The Parable of Google Flu: Traps in Big Data Analysis
David Lazer, 1, 2 * Ryan Kennedy, 1, 3, 4 Gary King, 3 Alessandro Vespignani 3,5,6 1 Lazer Laboratory, Northeastern University, Boston, MA 02115, USA. 2Harvard Kennedy School, Harvard University, Cambridge, MA 02138, USA. 3 Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA. 4University of Houston, Houston, TX 77204, USA. 5 Laboratory for the Modeling of Biological and Sociotechnical Systems, Northeastern University, Boston, MA 02115, USA. 6 Institute for Scientifi c Interchange Foundation, Turin, Italy.
http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf
Erin LeDell, h2o.ai
Stacked Regressions
Leo Breiman. (1996)
http://dx.doi.org/10.1007/BF00117832
http://statistics.berkeley.edu/sites/default/files/tech-reports/367.pdf
Scalable Ensemble Learning and Computationally Efficient Variance Estimation (Doctoral Dissertation)
Erin LeDell (2015)
http://www.stat.berkeley.edu/~ledell/papers/ledell-phd-thesis.pdf
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, Jeff Dean (2015)
http://arxiv.org/abs/1503.02531
Understanding Random Forests: From Theory to Practice (Doctoral Dissertation)
Gilles Louppe (2014)
http://www.montefiore.ulg.ac.be/~glouppe/pdf/phd-thesis.pdf
Generalized Low Rank Models
Madeleine Udell, Corinne Horn, Reza Zadeh, and Stephen Boyd (2014)
http://arxiv.org/abs/1410.0342