MLconf SF 2016 saw many wonderful speakers that took us on a tour of the many business sensitive aspects of modeling Big Data: choosing the right problem to solve, how it’s not always about deep learning; logistic regression, SVMs, and gradient boosting still have a heavy hand to play in solving problems and deriving insights; and how many of the best business solutions require different approaches for different aspects of a problem. Causality of factors in a model is an elusive quality that speaks to many business goals and the core reasons for starting a query, solving a problem, creating a model. Alex Dimakis, at the most recent MLconf in San Francisco, outlined for us one computational framework for ascertaining causality.

**What is Causal Inference?**

There are many frameworks for investigating causality including philosophical, statistical, & machine learning ones. To name a few of the most popular frameworks, these include Granger causality (used widely in Neuroscience and Economics), Hume’s counterfactuals framework, and Pearl’s structural equation causality (which grew out of Fisher & Wright’s genetic heredity work) that utilizes graphical causality to represent our understanding of causality. The various frameworks for understanding causality can partly be understood as a product of the fact that causality is complex, sometimes evidence for a cause is different across contexts, and the fact that some causes are not deterministic, while others are probabilistic.

Statistically, causation has a history with interpretability of models and model construction. Causal interest in models related to the mechanism, or understanding how the mechanism relates to the phenomenon under study. This requires defining what we mean by mechanism and understanding what we know and how we know it in order to ascertain whether two observations are related/not related causally.

In statistical causality, we try to address the question of whether or not there is a dependency between two variables. Correlations are usually the first blunt tool used to asses whether or not two variables are related and many would argue that with sufficient amounts of data, causation isn’t necessary because vast amounts of data (petabytes, etc) increase reliability of correlations while drowning out noise. So, why would we care about ascertaining causality when we have so much data?

One motivation for causal analyses is that while correlations show how two variables increase or decrease together, sometimes that is all two variables are: two time-series with similar trends. In decision-making, we do not want to rely on two things that correlate by chance or have similar trends but are causally unrelated. A recent example of this sort of effect is Google’s searches /Flu tracker keeping pace with Center for Disease Control (CDC) estimates of flu incidence. Flu incidence and searches related to flu in the Google correlated and seemed to predict one another until they didn’t. There could be many reasons for this: searches being magnified by friends, neighbors, and relatives looking for information related to someone’s flu (but not their own) or the impact of news about the flu on search queries. We know that the Google search predictor of flu incidence is not causal, and it was useful for a time (not all predictors need to be causal in order to be useful), but in an ideal world we would understand the causal factors in order to make decisions.

In assessing independence of observations, we ordinarily consider two uncorrelated variables to be independent from each other (orthogonal). This assumption holds when the observations are Gaussian (continuous measured variables). This doesn’t work for categorical or event-based observations such as “having a cancer diagnosis”, “buying a coat”, “starting a war.” With these variables, you must consider the joint probability of the two observations co-occurring. In a rank one contingency table, one compares the observed rate at which the two variables are observed together against marginalized independently generated data. To check for independency, one checks to see if the matrix is well characterized by the product of the scalars generated randomly/independently. If the contingency table approximates the independently generated data, the variables are independent. If not, the events/variables may have a dependent relationship that ought to be explored.

**Exploiting Conditional Dependency Graphs in order to infer causality: Interventions**

Alex Dimakis introduces Pearl’s structural equation framework for causal reasoning by introducing directed graphs first. Directed graphs can be used to represent conditional dependencies. For example, A-> B -> C, would represent variables A and C that have a relationship to each other conditioned on B. This is A and C are related to each other, but mediated or partially affected by variable B. Learning a directed graphical model is a first step to learning the causal relationships amongst variables. With enough data, one can learn all of the conditional independencies in the graph. Then, we can query the graph about directions of effects. For Pearl, the central question is about the direction of effects: did A cause B or did B cause A, or are these two variables not related to each other? However, we know that joint probability distributions on two random variables can be factorized in both ways and can be true in both the world that A causes B and when B causes A. So joint probability distributions are not enough. Therefore, we need an instrumental variable to intervene. An intervention can be a control group, or a known precondition, or a temporal factor that appears or is forced upon the observations before the event. In this case, if the observations are truly independent, your intervention cases will not differ from the other cases in event incidence because they are both randomly assigned. If you observe a difference, then you have a causal relationship between your two variables because they do not co-occur randomly.

One way to learn causal relationships of variables in a complex directed graph (Skeleton Graph) is to intervene on a set of variables (see graph representation). By intervening on a set of variables, one learns the directions of relationships between those variables intervened on and the edges of those variables connected that were not intervened on. After learning causal directions, one can design another intervention to learn more directions of the graph. There are two ways to proceed after this: either deciding a priori which order the interventions will be done, or deciding the order of interventions on the graph adaptively depending on the outcome of previous interventions (or a version of adaptive interventions by designing randomized adaptive interventions). This is an active area of research, including work done in Alex’s group, where the efficiency of learning causal directions of graphs is measured. Some groups have found that fixed interventions are enough to learn all you can from the Skeleton Graph about causal directions and adaptive interventions do not add more information, but there is some evidence to suggest that highly randomized adaptive interventions do result in better information. This is an active area of research.

The slides of Alex’s talk are available here and you can watch his full talk here. Dimakis recommends a course by Richard Scheines available online free of charge, here. If you are interested in learning more about this topic, Samantha Kleinberg, who spoke at MLconf NYC this year, has a book on the subject “Why: A Guide to Finding and Using Causes” and a video talk available here. MLconf has a code you can use to get this book with O’Reilly.

About the blogger: Ana Maria Fernandez is a PhD Candidate, Clinical and Molecular Medicine, at University of Edinburgh and an active member of the MLconf community.