Sorry this is work in progress. I need to learn the underlying math better to understand what's going on in many places. My goal here is to show how it's possible to use the simplest possible derivation rules to go from an arbitrary SWIG to a valid formula for the counterfactual quantities of interest. It's not nice to rely on specific results for every different situation when there seems to be a fairly straightforward general framework that would work in almost any situation.

Single-World Intervention Graphs in 10 Examples

SWIGs are a great first mathematically rigorous language for all aspiring sciences. Counterfactual thinking makes it clear that causality is about hypotheticals, like the expected outcome if some variables were set to some values for everyone. A causal diagram makes it clear which variables are assumed to be conditionally independent in the hypothetical world. Other assumptions can then be used to bridge the rest of the gap between counterfactual and observed measures, if logically possible.

In science, precise theory is necessary to make progress efficiently. Natural language and statistics alone cause a mess extremely easily, even when using top notch data collection like randomized controlled experiments – these 'languages' must be just too vague to talk about the right thing (causality and scientific knowledge in general).

I think causal diagrams should be thought of as the first mathematically rigorous language for aspiring scientific fields. Yes, you can't derive mechanistic differential equations like in physics, but mathematical rigour is much closer than that. Even if it was difficult to draw plausible graphs because knowledge is so thin, I think one should give it maximum effort – this effort alone is a way to highlight ignorance and take the knowledge forward. As the knowledge improves, old data can be analyzed in light of the new knowledge. This is just science.

A causal graph like SWIG is a simplified visual interface to a mathematical theory. Although it's possible to use the causal theory without the corresponding graphs, the knowledge and reasoning becomes less clear. So in this post I try to collect examples where we have a graph (knowledge) and then derive a quantity of interest which is identifiable with observable data.

Basic ideas

Basic ideas SWIG

The graph represents a set of SWIGs for each specific value of \(a\). It is actually called a single-world intervention template (SWIT) but they are usually called just SWIGs.

\(A\) represents the observed 'natural' value of the (random) variable and \(| a\) represents an intervention on the variable, setting it to a certain fixed value \(a\). The arrow from \(a\) reminds us that we assume \(A\) to affect \(Y\) even though actually \(a\) is set to a fixed constant and cannot be really thought of as a cause in that particular single world anymore (it cannot vary). Other similar weird arrows can exist in SWIGs, unfortunately – I'll show some of them later.
\(Y^a\) is the outcome (random variable) in this hypothetical world.
\(U\) and \(W\) are some independent causes of \(A\) and \(Y^a\) (these letters are used for unobserved variables by convention). As the number of variables grows, these would clutter the graph so they are not normally drawn – they exist implicitly.

Okay, say that we want to know the effect of \(A\) on \(Y\), like \(E[Y^{a=0}] - E[Y^{a=1}]\), which is the difference of the mean outcome in two hypothetical worlds \(a=0\) for all and \(a=1\) for all. (We assume that these values are possible for everyone in our population [positivity].) To estimate this effect, we'd need to know \(E[Y^a]\) but it is not observed. So what does the graph tell us that could bridge this to what we can obserse?

Using the rules of d-separation, the graph tells us that \(Y^a\) is independent of \(A\) so we know that \(E[Y^{a^{*}}] = E[Y^{a^{*}}|A=a]\): mean outcome in the whole population under some intervention is the same as the mean outcome under that intervention in subsets given by observed values of \(A\), that is, conditional on \(A\).

What to do with \(E[Y^{a^{*}}|A=a]\)? Another powerful assumption is consistency. This means that \(a^{*}\) is a sufficiently well-defined intervention on \(A\) (not compatible with wildly different things with different effects) and the observed values of \(A\) correspond to the intervened values. When this is the case, we know that the observed \(Y\) is the same as \(Y^{a^{*}}\) when \(a^{*}\) is the same as the observed \(A = a\). So, for example, \(E[Y^{a=1}|A=1]\) is the same as \(E[Y|A=1]\). In general, \(E[Y^a|A=a] = E[Y|A=a]\).

So, we can estimate the \(E[Y^a]\) by estimating \(E[Y|A=a]\), such as in general by modelling \(E[Y|A]\) and then predicting \(E[Y|A=1]\) and \(E[Y|A=0]\).

Common causes

Common causes SWIG

\(L\) and \(U\) are common causes of \(A\) and \(Y\). \(U\) is unobserved but some effects of \(U\) are observed, \(P_1\) and \(P_2\). Causal common causes are called causal confounders and variables that could be used as proxies of causal confounders are called surrogate or proxy confounders.

We are again interested in \(E[Y^a]\).

Using the rules of d-separation, the graph tells us that \(Y^a\) is independent of \(A\) conditional on \(L\) and \(U\) or conditional on \(L\), \(W\), \(P_1\), and \(P_2\). This is because common causes open backdoors in the graph and these can be closed by conditioning on them. We assume it is indeed possible to observe values of \(A\) for everyone in the conditioning subsets (positivity). Unfortunately we can't condition on the unobserved variables so we might decide to settle with \(L\), \(P_1\), and \(P_2\). Conditioning can be shown on the graph with a box around the variable.

We can start here from the fact that \(E[Y^{a^{*}}]\) could be recovered from a conditional expectation \(E[Y^{a^{*}}|L = l, P_1 = p_1, P_2 = p_2]\) by averaging over all possible values of \({l, p_1, p_2}\):

\(\sum_{l, p_1, p_2} E[Y^{a^{*}}|L = l, P_1 = p_1, P_2 = p_2] \times P(L=l, P_1, = p_1, P_2 = p_2)\)

Next, we hope that \(A\) is approximately independent of \(Y^a\) given only \({L, P_1, P_2}\) so that the conditional expectation above should be close to the one in subsets of \(A\) too, like \(E[Y^{a^{*}}|A=a, L=l, P_1 = p_1, P_2 = p_2]\).

And assuming consistency, we can again arrive to \(E[Y|A=a, L=l, P_1 = p_1, P_2 = p_2]\) for the conditional expectation:

\(E[Y^{a^{*}}] \approx \sum_{l, p_1, p_2} E[Y| A = a, L = l, P_1 = p_1, P_2 = p_2] \times P(L=l, P_1, = p_1, P_2 = p_2)\)

This method is a form of g-formula which can be represented in multiple ways:

Standardization (above): \(\sum_l E[Y|A=a, L=l] \times P(L=l)\)
Iterated conditional expectations: \(E[E[Y|A=a, L]]\) (in general using simulation from fitted models)
Inverse probability weighting: \(E\Big[\frac{I(A=a)Y}{f[A|L]}\Big]\) (in general using iteratively reweighted least squares model fitting or similar)
Robust methods combining some of above

Conditioning on common effects

SWIG missing...

The variables \(S_1\), \(S_2\), and \(C^a\) are common causes of \(A\) and \(Y\) or variables associated with them. When our observations are conditional (selected) on them or their descendants (\(S^a_3\)), a backdoor path opens in the graph and the result is selection bias in the association of \(A\) and \(Y\).

We are again interested in \(E[Y^{a^{*}}]\).

Using the rules of d-separation, a backdoor is open through \(S_2\) and this could be blocked by conditioning on \(L\) or \(K\). Similarly, the backdoor path via \(S_1^a\) can be blocked conditioning on \(J\). However, we are out of luck with \(S_3^a\); the path via \(C^a\) will remain open whether or not we condition on \(C^a\).

We check the same for the selected variables. \(Y^a\) is independent of \(S_2\) conditional on \(K\) and independent of \(S_1^a\) conditional on \(J\) (\(A|a\) is a constant, not a confounder here). \(Y^a\) is independent of \(S_3^a\) conditional on \(C^a\).

We should get better data but what's the closest we can get? We could ignore selection bias caused by conditioning on \(S_3^a\).

Start from the weighted average again \(\sum E[Y^{a^{*}}|K=k, L=l, J=j] \times P(K=k, L=l, J=j)\). Given the conditional independencies from the graph, we can again subset the conditioning further since the expectation should be the in these subsets:

\(\sum E[Y^{a^{*}}|A=a,S_1^a=1,S_2=1,K=k,L=l,J=j] \times P(K=k, L=l, J=j)\)

Assuming consistency and using the iterated expectations representation we could then write \(E[E[Y|A=a, S_1 = 1, S_2 = 1, K, L, J]]\). And perhaps hope for \(E[Y^{a^{*}}] \approx E[E[Y|A=a, S_1 = 1, S_2 = 1, S_3 = 1, K, L, J]]\).

Censoring

Censoring means missing outcome data which forces us to again condition on being uncensored. This can cause selection bias as before.

Censoring SWIG

We are again interested in \(E[Y^{a^{*}}]\).

Using the rules of d-separation, the graph tells us that \(Y^a\) is independent of \(A\) conditional on \(L\), and independent of \(C^a\) conditional on \(L\).

Total probability, exchangeability (and positivity), and consistency propel us forward as before: \(E[Y^{a^{*}}] = E[E[Y^{a^{*}}|L]] = E[E[Y^{a^{*}}|A=a, C^a=0, L]] = E[E[Y|A=a, C=0, L]]\)

Missing data

Previously we thought of missing data as conditioning on a variable. We can also think of missing data as measurement error (for each variable separately) where the hypothetical fully observed variable causes the observed partly observed variable with some missing data mechanism, often labelled \(R \in \{0,1\}\).

Missing data SWIG

We are again interested in \(E[Y^{a^{*}}]\).

Using d-separation, the graph tells us that \(Y^a\) is independent of \(A\) given \(L\) and independent of \(R_A\) given \(L\). The only new thing we need to remember is that, if we subset into the non-missing subset, we can use the missing-containing variables \(\bar{A}\) in place of its hypothetical fully-observed source \(A\) as these have the same values in the non-missing data subset.

So the chain of inference could go like this: \(E[Y^{a^{*}}] = E[E[Y^{a^{*}}|L]] = E[E[Y^{a^{*}}|A=a, R_A = 0, L]] = E[E[Y^{a^{*}}|\bar{A}^a=a, R_A = 0, L]] = E[E[Y|\bar{A}=a, R_A = 0, L]]\)

Measurement bias

Measurement bias SWIG

If we have measurement error instead of just missing values, we have no convenient subsets to aim for. Dealing with measurement bias at the analysis stage may be possible by modelling the error, including correlation between errors (caused by some unobserved \(W\)) and measured sources of measurement bias (here from treatment \(A\)).

Measurement error in confounders also results into measurement bias since the mismeasured confounder is not enough to control for confounding by the true confounder.

Immortal time bias

In progress...

Treatment-confounder feedback

Treatment-confounder feedback SWIG

\(L\) is a time-varying confounder of the effect of the time-varying treatment \(A\) on the end-of-follow-up outcome \(Y\).

We are interested in \(E[Y^{a_0, a_1}]\).

Start again from the law of total expectation in terms of the confounders \(\sum E[Y^{a_0, a_1}| L_0 = l_0, L_1^{a_0} = l_1] \times P(L_0=l_0, L_1^{a_0}=l_1)\). Then we can factorize the joint probability as \(P(L_0 = l_0) \times P(L_1^{a_0} = l_1|L_0 = l_0)\).

Now we use the assumed independencies. \(L_1^{a_0}\) is independent of \(A_0\) conditional on \(L_0\) so we can add \(A_0\) to \(P(L_1^{a_0} = l_1|L_0 = l_0, A_0 = a_0)\). Assuming consistency makes this equal to the observable \(P(L_1 = l_1|L_0 = l_0, A_0 = a_0)\).

Both \(A_0\) and \(A_1^{a_0}\) are independent of \(Y^{a_0, a_1}\) conditional on the confounders so we can add them to the conditioning of the expectation \(E[Y^{a_0, a_1}| L_0 = l_0, L_1^{a_0} = l_1, A_0 = a_0, A_1^{a_0} = a_1]\). Finally assuming consistency we have the observable \(E[Y| L_0 = l_0, L_1 = l_1, A_0 = a_0, A_1 = a_1]\). In total, we have the g-formula...

\(\sum_{l_0, l_1} E[Y| L_0 = l_0, L_1 = l_1, A_0 = a_0, A_1 = a_1] \times P(L_1 = l_1|L_0 = l_0, A_0 = a_0) \times P(L_0 = l_0)\)

(Something may be wrong with this...)

Mediator-confounder feedback

In progress...

Dynamic treatment strategies

In progress...

CC BY-SA 4.0 Eero Teppo. Last modified: March 23, 2025. Website built with Franklin.jl and the Julia programming language.