eteppo

Target Trial Emulation in Practice

Published: 2024-08-08

  1. Define your question as a target trial protocol

In target trial emulation, the research question will be in the form of a trial protocol and a causal contrast or effect measure. The effect is defined and identified using a causal theory of choice – many are basically equivalent – and estimated using statistical methods of choice. (Currently I like counterfactuals and single-world intervention graphs [SWIG], and Bayesian models in probabilistic programming languages inferred with MCMC algorithms).

Familiarity with trials is very helpful but the basic elements are fairly simple:

Obviously (and importantly), trials are designed and run over time and so contain different kinds of dates, observation windows, and visit schedules such as

When the trial protocol and the variables are defined, the causal contrast or effect measure can be defined precisely. Some options include:

Causal diagrams (like SWIG) can be used to clarify the effect of interest and how it is identifiable in the target trial.

Finally, a data analysis plan for estimating the causal contrast of interest – given data from this target trial – is written. Note that this plan is made blind to any results from real data and will serve as a template for the emulation analysis plan. Problems arising later with real data will be ironed out in a series of sensitivity analyses.

As a fan of Bayesian analysis with probabilistic programming languages, a few points on model building are in order here:

  1. Plan how to emulate each element of the target trial protocol

Even though the target trial is supposed to define our goal, it is not useful as a goal if it's too far from realistic emulation. Even the first guess at a target trial (a question) of interest can't be too fanciful as we eventually need to emulate it with real data.

For most features of the trial, emulation is just a matter of finding relevant data sources that match or approximate the measurements that would've been done in the target trial. If data sources are not found, the target trial (question) needs to be modified to match the available data (or the project is stopped). This is the case for eligibility, treatment, time zero, end of follow up, and primary outcome – all of which need to match the target trial. A precise description of variables is needed, that is, how each target trial element has been measured (or predicted) – comparing an ideal variable in a target trial to the available data is a good way to clarify sources of error. In general, evaluating the reliability of observational 'found' data (and prediction models) is crucial.

Placebo treatments may be possible to emulate using other treatments that can be assumed to be inert to the outcome of interest. For example, in drug research, other unrelated drugs have been used as placebos. However, it is often quite unclear how close any candidate placebo-emulator actually is to a real placebo used in a plasebo-controlled trial. Anyway, placebos are rarely used without blinding and other active comparators may be more informative for practice.

Treatment assignment is a trial concept that is not usually available in observational data. In trials, assignment is basically a treatment plan, so in principle, different treatment plans (e.g. prescriptions) might be used to emulate treatment assignment. However, many trials assign and start the treatment at the same time, so the first treatment can be used as an analog for assignment. As such, the intention-to-treat effect can be emulated by comparing outcomes under interventions on initiating treatment and remaining uncensored, such as \(\Pr[Y_k^{a_0 = 1, \bar{c} = \bar{0}} = 1] - \Pr[Y_k^{a_0 = 0, \bar{c} = \bar{0}} = 1]\) up to time \(k\), often called the observational analogue of intention-to-treat. Note how \(a_0 = 1\) is used instead of the missing \(z = 1\).

Per-protocol effect emulation is possible by assuming exclusion restriction which means that assignment affects outcome only through the treatment. So, a per-protocol effect could compare outcomes under interventions on full treatment and remaining uncensored, such as \(\Pr[Y_k^{g_1, \bar{c} = \bar{0}} = 1] - \Pr[Y_k^{g_0, \bar{c} = \bar{0}} = 1]\) up to time \(k\).

Randomization of the treatment assignment can then be emulated using methods of causal inference. Hypothetically, if all relevant confounders (a sufficient adjustment set) were known and measured precisely, the treatment would be effectively randomized within levels of the confounders (g-formula). Alternatively, special causal structures may be searched for and used to emulate randomization. So, emulating randomization usually starts from making causal assumptions on a causal diagram (like a SWIG) and then finding data to block the backdoor paths in analysis. This is done to assume exchangeability. Again, measurement (or prediction) of counfounders needs to be sufficiently reliable, complete, and precise.

All assumptions needed to identify counterfactual measures using observable data is slightly more complicated. Here's a summary.

The most generally useful conditions for identifying causal effects are exchangeability, positivity, and consistency. These would be fulfilled in a conditionally randomized experiment which we try to emulate. More specifically, we need to assume that

Less commonly, it may be possible to avoid assuming 'no unmeasured confounding' by leveraging known complete mediators of random assignment (instrumental variables), mediators of effect (frontdoor formula), and mediators of unmeasured confounding (proximal inference). Each method assumes a certain causal structure and possibly many additional assumption. I won't cover these methods in this post.

For example, common sources of confounding in medicine include age and sex, ethnicity and socioeconomic factors, comorbidity and outcome severity factors, co-treatment and health behaviour factors, and healthcare use. In general, confounders can be identified by thinking about the most influencial factors that affect choice of treatment and risk of outcome or that are effects of such factors (or maybe just predictors). Note that misclassifying colliders or instrumental variables as common causes will cause bias (!) – domain knowledge on the causal graph is important.

A data analysis plan for the emulation is then specified. It is usually similar to the target trial analysis plan except for the variables and methods needed to emulate randomized treatment assignment.

Finally, when a feasible target trial and emulation protocol has been found, the protocols should be preregistered before they can be affected by results.

  1. Perform pre-inference checks

Before proceeding to inference, avoidable errors should be screened. This includes sufficient sample size (precision), distribution of baseline variables by treatment (positivity violations), distribution of variables by adherence to assignment (informative censoring), and possible model misspecification indicators. The model and protocol can be still modified while the analyst is blinded with respect to primary result – but the changes need to be reported as amendments to the original plan.

In Bayesian analysis, posterior predictive checking can show how close the generated data from the fitted model is to the real data. Often comparing visualizations of datasets side-by-side is informative about severe misspecification – any measures of interest on the real and predicted data can be compared. Measures of predictive accuracy can be used to summarize the overall predictive performance but it is crucial to remember that predictive accuracy is not the goal of causal inference nor modelling in general. To make the predictions more realistic (out-of-sample), cross validation is necessary (such as with PSIS-LOO).

  1. Plan post-inference checks

What is the effect of plausible study changes and assumption violations on the result? A plan to evaluate the robustness of the primary result should be prespecified.

In principle, any choice in the study pipeline could be varied to assess its effect on the result. The more arbitrary the decision, the more its effect on results should be assessed.

The effect of potential measurement error, unmeasured confounding, and selection bias can be explored through modelling and simulation. Net bias can be also assessed by replicating published RCTs or by using negative (or positive) control variables, that is, by replicating other well-known effects or associations.

In Bayesian modelling, the following has been recommended:

  1. Perform inference and checks and report

Planning and pre-specification is key to minimize work at the inference phase. This way study design is not affected by results which reduces bias in the final result. However, iterative model improvement may be still needed and this is fine when it is not directed at certain metrics and it's reported as amendments to the original plan.

Toy example

Target trial specification

SWIG of the target trial

Assuming exchangeability, positivity, and consistency as well as adherence, exclusion restriction, and no censoring for simplicity (and some other assumptions), we can identify...

\(\Pr[Y_k^{\bar{z}, g} = 1] = \Pr[Y_k^{\bar{z}} = 1] = \Pr[Y_k = 1 | Z = z] \Rightarrow\) \(1 - \Pr[Y_k = 0 | Z = z] \Rightarrow\) \(1 - \prod_{m=1}^{k} \Big( 1 - \Pr[Y_m = 1 | Y_{m-1} = 0, Z = z] \Big)\)

In a primary analysis plan, the hazard \(\Pr[Y_m = 1 | Y_{m-1} = 0, Z = z]\) is estimated with a logistic regression model on person-time data. Risk up to \(k\) is predicted for \(Z=1\) and \(Z=0\) and subtracted. The mean and 2.5-97.5% percentiles of the posterior are computed.

Emulation

We pretend the data contains accurate and consistent measurements of treatment \(A\) and outcomes \(Y\) over time intervals \(k\) for eligible units as in the target trial. End of follow-up is at the outcome or end of study. We emulate the per-protocol effect as \(\Pr[Y_k^{g_1} = 1] - \Pr[Y_k^{g_0} = 1]\) assuming assignment similar to the target trial.

The SWIG is difficult to draw at this point as there are way too many arrows. But to get the idea, I sketched the graph below.

SWIG of the emulation data

We can at least see how a time-varying confounder (in red) can generally open backdoors all over the place. This is the reason why we usually need to adjust for the whole history of the confounder. The confounders also become counterfactual through treatment-confounder feedback so one also needs to adjust conditional on the treatment history. Blocking \(L\) and \(A\) can also block common causes of \(L\) and \(Y\) which are called \(W\) is this sketch.

Assuming exchangeability of treatment groups at each \(k\) conditional on past treatment and confounders, positivity, and consistency as well as adherence, exclusion restriction, and no censoring for simplicity (and some other assumptions), we can identify \(\Pr[Y_k^g = 1]\) using the parametric g-formula.

The g-formula for survival outcomes is the conditional survival curve standardized to the confounder distribution...

\(\sum_l \Big(\prod_{m=0}^k \Pr[Y_{m+1} = 0|Y_m = 0, L=l, A=a]\Big) \times \Pr[L=l]\)

The g-formula for end-of-follow outcomes and time-varying treatments and confounders is the conditional mean outcome standardized to the confounder distribution at each time step conditional on history needed for exchangeability...

\(\sum_{\bar{l}} E[Y|\bar{A} = \bar{a}, \bar{L} = \bar{l}] \times \prod_{k=0}^K f(l_k|\bar{a}_{k-1}, \bar{l}_{k-1})\)

We need a combination of these variants but it's not necessary to use this representation of the g-formula. We can also think of an equivalent process as a simulation.

The parametric (plug-in) g-formula can be estimated using simulation because a weighted average of the expectations is the same as an expectation of the expectations. The whole density distribution (marginal or conditional as above) is not necessary since samples from the distribution are enough to average over – and these samples can be either observed or simulated. This is the iterated conditional expectations (ICE) representation of the g-formula.

In the primary analysis plan, we model the outcome hazard using the same logistic regression model on person-time data adjusting for confounders. The confounders and past treatment (A, L1, L2, L3) are modelled similarly using regression models. Finally, data are simulated sequentially and the hazard estimated at each \(k\) and transformed to risk up to \(k\). The mean and 2.5-97.5% percentiles of the posterior of the risk are computed.

Data analysis

In Bayesian analysis, the model and the simulator are often the same. This makes planning the data analysis from scratch much easier.

Here's a quick and dirty implementation of the model in Julia (Turing).

@model function pooled_logit_hazard(
	n::Int, 
	K::Int,
	L1::Matrix{Union{Missing, Int}},
	L2::Matrix{Union{Missing, Int}},
	L3::Vector{Union{Missing, Int}},
	A::Matrix{Union{Missing, Int}},
	Y::Matrix{Union{Missing, Int}}
)

	# Baseline covariate uncertainty.
	p₁ ~ filldist(Beta(1, 1), 4)

	# Global average log-odds uncertainty.
	# logit(0.1) puts the center on 10% probability.
	# log(10) would then spread 1SD uncertainty on 1-52% where 10 is odds ratio.
	αL1 ~ Normal(logit(0.1), log(50))
	αL2 ~ Normal(logit(0.1), log(50))
	αA ~ Normal(logit(0.5), log(50))
	αY ~ Normal(logit(0.01), log(50))
	
	# Time-dependent adjustment on the global average.
	# log(1) puts the center on (conditional) odds ratio 1 (constant over time).
	# log(5) spreads 1SD uncertainty on 0.2-5 conditional odds ratio.
	αtL1 ~ filldist(Normal(log(1), log(5)), 6)
	αtL2 ~ filldist(Normal(log(1), log(5)), 6)
	αtA ~ filldist(Normal(log(1), log(5)), 6)
	αtY ~ filldist(Normal(log(1), log(5)), 7)

	# Confounder and lag-1 coefficient uncertainty.
	# log(1) puts the center on odds ratio 1 (conditional independence).
	# log(10) spreads 1SD uncertainty on 0.1-10 odds ratio.
	βL1 ~ filldist(Normal(log(1), log(10)), 2)
	βL2 ~ filldist(Normal(log(1), log(10)), 2)
	βA ~ filldist(Normal(log(1), log(10)), 4)
	βY ~ filldist(Normal(log(1), log(10)), 7)

	# k = 1
	for i in 1:n
		L1[i,1] ~ Bernoulli(p₁[1])
		L2[i,1] ~ Bernoulli(p₁[2])
		L3[i,1] ~ Bernoulli(p₁[3])
		A[i,1] ~ Bernoulli(p₁[4])
		
		leY = αY + αtY[1] + 
		      βY[1]*A[i,1] + 
		      βY[3]*L1[i,1] + βY[5]*L2[i,1] + βY[7]*L3[i]
		Y[i,1] ~ BernoulliLogit(leY)
	end

	for k in 2:K
		
		atrisk = findall((x -> !ismissing(x) && x == 0), Y[:,k-1])
		length(atrisk) == 0 && break
		
		for i in atrisk
			leL1 = αL1 + αtL1[k-1] +
			       βL1[1]*L1[i,k-1] + βL1[2]*L3[i]
			L1[i,k] ~ BernoulliLogit(leL1)
			
			leL2 = αL2 + αtL2[k-1] + 
			       βL2[1]*L1[i,k] + βL2[2]*L3[i]
			L2[i,k] ~ BernoulliLogit(leL2)
			
			leA = αA + αtA[k-1] + 
			      βA[1]*A[i,k-1] + βA[2]*L1[i,k] + 
			      βA[3]*L1[i,k-1] + βA[4]*L3[i]
			A[i,k] ~ BernoulliLogit(leA)
		
			leY = αY + αtY[k] + 
			      βY[1]*A[i,k] + βY[2]*A[i,k-1] + 
			      βY[3]*L1[i,k] + βY[4]*L1[i,k-1] + 
			      βY[5]*L2[i,k] + βY[6]*L2[i,k-1] + 
			      βY[7]*L3[i]
			Y[i,k] ~ BernoulliLogit(leY)
		end
	
	end
	return (L1 = L1, L2 = L2, L3 = L3, A = A, Y = Y)
end

I won't do much testing in this post as it's quite time-consuming. But we can take a quick look on the prior predictive distribution of the outcome \(Y\).

unobserved_data = (
	L1 = Matrix{Union{Missing, Int}}(missing, 100, 7),
	L2 = Matrix{Union{Missing, Int}}(missing, 100, 7),
	L3 = Vector{Union{Missing, Int}}(missing, 100),
	A = Matrix{Union{Missing, Int}}(missing, 100, 7),
	Y = Matrix{Union{Missing, Int}}(missing, 100, 7)
)

priors = sample(
	pooled_logit_hazard(100, 7, unobserved_data...),
	Prior(),
	100
)

Here are a sample of risk functions that the priors allow.

We can also simulate data with some fixed parameter values and check the posterior predictive distribution of the outcome \(Y\) conditional on the simulated dataset.

fixedmodel = fix(
                pooled_logit_hazard(100, 7, unobserved_data...),
                (
                        p₁ = fill(0.25, 4),
                        αL1 = logit(0.5),
                        αL2 = logit(0.5),
                        αA = logit(0.5),
                        αY = logit(0.05),
                        # Simplest case: Zero coefficients all over.
                        αtL1 = fill(0, 6),
                        αtL2 = fill(0, 6),
                        αtA = fill(0, 6),
                        αtY = fill(0, 7),
                        βL1 = fill(0, 2), 
                        βL2 = fill(0, 2), 
                        βA = fill(0, 4), 
                        βY = fill(0, 7),
                )
        )

simdata = fixedmodel()

preds_given_sim = predict(
	pooled_logit_hazard(100, 7, unobserved_data...),
	posts_given_sim
)

Here's the whole posterior distribution of risk functions from the model fitted on the simulated dataset.

At inference time, we would simulate the fitted/learned model given initial values of covariates (A,L) and either the treatment strategy \(g_1\) or \(g_0\) and then compute the effect measure as a difference of the two posterior predictions. Given all the assumptions, this estimate would then have the causal interpretation we have been targeting for. (Note that only this estimate has a causal interpretation – not the model coefficients or anything else.)

The generated_quantities or predict functions do the job in Julia's Turing library.

Then we would perform the planned robustness checks.

I might make this toy example more complete over time. This is it for now.

CC BY-SA 4.0 Eero Teppo. Last modified: March 23, 2025. Website built with Franklin.jl and the Julia programming language.