Practical Causal Inference

Putting Judea Pearl's ideas into actual code

One of the criticisms of machine learning is that it’s only learning from correlations in data. However deep your neural network is, most of the patterns it’s matching are likely devoid of true understanding of the latent factors resulting in the “why” of what the data.

Current state of AI

It’s important to remember that AI is not magic. It is not some genie that can spontaneously solve all our problems. It all boils down to collections of nested statistical methods. If we want to use AI to solve complex problems like aging, global warming, or poverty, the algorithms are going to need a serious upgrade. However much neural networks are touted as being the next step on the road to AGI, there is still MUCH further to go.

One of these limitations is lack of causal reasoning. While machine learning typically focuses on prediction, causal inference relates to decision-making. If supervised learning is akin to classical conditioning, and reinforcement learning is akin to operant conditioning, causal inference is the ML equivalent of learning by reasoning.

For example, if we want to predict whether our user will continue to use our service in the next year, based on their behavior in the first month, We’d use machine learning techniques to figure this out. Specifically, we’d use a classification model as a function of first-month behaviors. We could then narrow down a series of behaviors that highly correlate with staying with the service for 1+ years, which would help optimize marketing spending. The magic moment isn’t about determining retention, it’s determining the set of 1st-month behaviors that causally drive retention. In Facebook’s case, if users who obtain more friends at the start are more likely to obtain just because they’re different people (e.g., more social, more interested in the product, more addicted to technology), then they can make strategic product decisions to invest in early friending, based on observational correlations between real friending and these personality traits. Making decisions just on these correlations would yield suboptimial business results.

We need machines to be able to reason causally, not just by association. Reasoning by association was a challenge that was solved decades ago with inventions like bayesian networks, which automatically associated a potential cause with a set of observable conditions. Bayesian networks made it possible, for say, a patient returning from Africa with a fever, the most likely diagnosis was malaria. The second step up from that, which we are on today, is a glorified version of that first step. Deep learning finds hidden regularities in large sets of data. All of deep learning’s impressive achievements amount to curve-fitting, meaning fitting a model to the data so we can make predictions.

Building truly intelligent AI systems will only be possible when we replace reasoning by association with causal reasoning. Instead of merely correlating fever with malaria, machines need to be able to reason that Malaria causes fever. Once we have this kind of causal framework in place, it becomes possible to ask computers for counterfactuals. For example, how the causal relationship would change, given some kind of intervention X, which is the cornerstone of scientific research

Judea Pearl, who is credited with inventing bayesian networks, recently wrote “The book of Why”. In it, he divides causal reasoning into 3 layers:

Level (symbol) Typical Activity Typical Questions Examples
Level 1: Association p(yx)p(y\mid x) Seeing What is? How would seeing xx change my belief in yy? What does a symptom tell me about a disease? What does a survey tell us about the election results?
Level 2: Intervention p(ydo(x),z)p(y\mid \mathrm{do}(x), z) Doing What if? What if I do xx? What if I take aspirin, will my headache be cured? What if we ban cigarettes?
Level 3: Counterfactuals p(yxx,y)p(y_x\mid x’,y’) Imagining, Retrospection Why? Was it xx that caused yy? What if I had acted differently? Was it the aspirin that stopped my headache? Would Kennedy be alive if Oswald didn’t shoot him? What if I had not been smoking for the past 2 years?

Counterfactuals are the building blocks of scientific thinking, as well as legal and moral reasoning. Each of these has a syntactic signature that characterizes it. Associations are characterized by conditional probability p(yx)p(y|x). We can use bayesian networks, or any of the other deep learning models that support this, to come up with associations. At the interventional layer, we define the probability of event yy given that we intervene (Judea pearl calls this the do()\mathrm{do}() operator) and set xx to a different value and observe zz. We can estimate this experimentally, or we can do this with analytic techniques like bayesian neural networks. At the Counterfactual level, we have the probability of event yy had xx been some value, given that we already observed x=xx=x’ and y=yy=y’. Judea Pearl’s book theorized about the need for a new type of mathematics (called “do calculus”) to formalize these computational counterfactuals. “The Book of Why” was a fantastic introduction to the concept of formalizing causal inference. I recommend it to anyone that wants a high-level understanding…

…but…

…we can’t rely on just high-level details forever. At some point, you’re going to be curious about how to actually implement these tools that you’ve learned. Several groups and companies have already been working on tools for causal inference. Microsoft Research recently put together an experimental library called doWhy (Github repo here)

Let’s go through an example use case. We can download DoWhy from the github. For this problem we will use the LaLonde dataset. There are four main stages to using DoWhy:

Stage 1: Modelling (loading the data and hypoteses)

DoWhy models each problem using a graph of causal relationships. The current version of DoWhy supports two formats for graph input: gml (preferred) and dot. The graph might include prior knowledge of the causal relationships in the variables but DoWhy does not make any immediate assumptions. For the model, we want to first specify the dataset that we’re using, followed by producing a graphical model of the data.

Our dataset is going to have a linear relationship, represented by a slope β=10\beta=10. We are going to populate it with 10,00010,000 samples, each of which has 55 common causes and 22 instrumental variables.

from IPython.display import Image, display
data = dowhy.datasets.linear_dataset(beta=10,
        num_common_causes=5,
        num_instruments = 2,
        num_samples=10000, 
        treatment_is_binary=True)
df = data["df"]
print(df.head())
print(data["dot_graph"])
print("\n")
print(data["gml_graph"])
# With graph
model=CausalModel(
        data = df,
        treatment=data["treatment_name"],
        outcome=data["outcome_name"],
        graph=data["gml_graph"]
        )
model.view_model()
display(Image(filename="causal_model.png"))

From this, we can get the following data:

id Z0Z_0 Z1Z_1 X0X_0 X1X_1 X2X_2 X3X_3 X4X_4 vv yy
0 0.0 0.446421 -0.251085 0.303230 -1.135984 -1.054753 -1.755098 0.0 -5.488247
1 1.0 0.841206 -0.467667 1.445933 -0.565382 -0.860269 0.139650 1.0 12.158423
2 1.0 0.894300 -2.391849 1.364717 0.446133 -0.613055 -0.165108 0.0 -5.132429
3 1.0 0.522393 -0.019511 0.923259 -0.283176 -1.048390 -1.941862 1.0 7.464227
4 1.0 0.147642 0.358300 -0.036608 0.296040 -2.827254 -0.590971 0.0 -4.639088

and the the following graph of the proposed relationship between the variables and treatments

This graph represents the various assumptions we have about the dataset. Z0Z_0 and Z0Z_0 represent the instrumental variables for the treatment and outcome, respectively.

Stage2: Identification (What kinds of causal models can be build from this?)

Using the input graph, DoWhy finds all possible ways of identifying a desired causal effect based on the graphical model. It can ignore the data and only use the graph of the relationships. It uses graph-based criteria and do-calculus to find potential ways find expressions that can identify the causal effect.

identified_estimand = model.identify_effect()
print(identified_estimand)
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['X0', 'X2', 'X1', 'X3', 'Unobserved Confounders', 'X4']
WARNING:dowhy.causal_identifier:There are unobserved common causes. Causal effect cannot be identified.

WARN: Do you want to continue by ignoring these unobserved confounders? [y/n] y

INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:['Z0', 'Z1']

Estimand type: ate
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d                                
──(Expectation(y|X0,X2,X1,X3,X4))
dv                               
Estimand assumption 1, Unconfoundedness: If U→v and U→y then P(y|v,X0,X2,X1,X3,X4,U) = P(y|v,X0,X2,X1,X3,X4)
### Estimand : 2
Estimand name: iv
Estimand expression:
Expectation(Derivative(y, Z0)/Derivative(v, Z0))
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→Z0,Z1)
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→v, then ¬(Z0,Z1→y)

We now have our expression for the causal model. Since this is a simple linear model, we can describe it as we would a slope, β\beta, to solve for.

β=ddvExpectation(yX0,X2,X1,X3,X4)\beta = \frac{d}{dv} \mathrm{Expectation}(y \mid X_0,X_2,X_1,X_3,X_4)

We also have a second estimand, iviv, which does not share the unconfoundedness assumption of the first expression (which, since we generated this data out of thin air, we are less concerned about).

We can add a parameter flag ( proceed_when_unidentifiable) if we want to ignore the warnings for confounding variables. The same parameter can also be added when instantiating the CausalModel object.

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['X0', 'X2', 'X1', 'U', 'X3', 'X4']
INFO:dowhy.causal_identifier:All common causes are observed. Causal effect can be identified.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]

Estimand type: ate
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d                                
──(Expectation(y|X0,X2,X1,X3,X4))
dv                               
Estimand assumption 1, Unconfoundedness: If U→v and U→y then P(y|v,X0,X2,X1,X3,X4,U) = P(y|v,X0,X2,X1,X3,X4)
### Estimand : 2
Estimand name: iv
No such variable found!

Our second estimand is now missing from the picture.

This/These expressions can then be evaluated using the available data in the estimation step. It is important to understand that the identification and estimation are orthogonal steps.

Stage 3: Estimation (getting actual values for our model)

Now we can actually produce the causal estimate. DoWhy estimates the causal effect using statistical methods such as matching or instrumental variables. The current version of DoWhy supports estimation methods based such as propensity-based-stratification or propensity-score-matching. These focus on a combination of regression techniques and estimates of the treatment response.

causal_estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression")
print(causal_estimate)
print("Causal Estimate is " + str(causal_estimate.value))
INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator
INFO:dowhy.causal_estimator:b: y~v+X0+X2+X1+X3+X4

*** Causal Estimate ***

## Target estimand
Estimand type: ate
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d                                
──(Expectation(y|X0,X2,X1,X3,X4))
dv                               
Estimand assumption 1, Unconfoundedness: If U→v and U→y then P(y|v,X0,X2,X1,X3,X4,U) = P(y|v,X0,X2,X1,X3,X4)
### Estimand : 2
Estimand name: iv
Estimand expression:
Expectation(Derivative(y, Z0)/Derivative(v, Z0))
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→Z0,Z1)
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→v, then ¬(Z0,Z1→y)

## Realized estimand
b: y~v+X0+X2+X1+X3+X4
## Estimate
Value: 10.00000000000073

Causal Estimate is 10.0

We now have a precise measure of the causaility between the effects. Our variable β\beta, representing our causal model, has a value of 1010.

Stage 4: Verification (how stable is this model?)

Of course, we also want to test the validity of this assumption between the variables. DoWhy uses different robustness methods to verify the validity of the causal effect. We can use DoWhy to test the robustness of a variety of causal effects.

For example, what if we added another random variable?

res_random=model.refute_estimate(identified_estimand, estimate, method_name="random_common_cause")
print(res_random)
INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator
INFO:dowhy.causal_estimator:b: y~v+X0+X2+X1+X3+X4+w_random

Refute: Add a Random Common Cause
Estimated effect:(10.00000000000073,)
New effect:(10.000000000000728,)

Still seems to be relatively robust.

What if we replaced the given treatment with a random (placebo) variable?

res_placebo=model.refute_estimate(identified_estimand, estimate,
        method_name="placebo_treatment_refuter", placebo_type="permute")
print(res_placebo)
INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator
INFO:dowhy.causal_estimator:b: y~placebo+X0+X2+X1+X3+X4

Refute: Use a Placebo Treatment
Estimated effect:(10.00000000000073,)
New effect:(-0.018041580350778713,)

We see a much bigger change now with the placebo! The value of β\beta based on how we previously formulated it has collapsed to near 00.

But what if this causal inference is the result of an unusually influential data instance? Will the pattern still hold with a different subset of the data?

res_subset=model.refute_estimate(identified_estimand, estimate,
        method_name="data_subset_refuter", subset_fraction=0.9)
print(res_subset)
INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator
INFO:dowhy.causal_estimator:b: y~v+X0+X2+X1+X3+X4

Refute: Use a subset of data
Estimated effect:(10.00000000000073,)
New effect:(10.000000000001288,)

As we can see, the a simplistic model like a linear regression estimator is very sensitive to even simple refutations.


I enjoyed using this framework. It felt like the DoWhy team built it to feel like a combination of using Pandas and using Keras. Several parts of “do calculus” are much easier with it. There are still a few features that will be needed in any causal inference framework. Reinforcement learning is one of the obvious areas that could benefit from causal inference. Solving more complex environments that require puzzle-solving will not be possible without integrating some kind of causal reasoning. Unfortunately, DoWhy only supports tabular data that can be stored in pandas dataframes. While this does make it compatible with models that can be developed in Tensorflow and PyTorch, it is still not quite at the stage where it can be used for reinforcement learning.

For more details on Causal inference, I highly recommend the blog iNFERENce.VC. The author made a wonderful guide available here.

Subscribe to know whenever I post new content. I don't spam!


At least this isn't a full screen popup

That would be more annoying. Anyways, if you like what you're reading, consider subscribing to my newsletter! I'll notify you when I publish new posts - no spam.