Understanding the Parametric G-Formula Through Source Code Analysis

Published: 2023-08-14

If you are better at reading code than math, you can learn mathematical methods better by reading their implementations. Let's try this with the parametric g-formula.

G-formula is a causal inference method which, as a general equation, looks like this:

Now if you're like me, you don't really know what this equation asks you to do in practice. Fortunately others know, and you can learn it directly from their implementations.

The summary is: you just need to run simulations of the causal graph!

Draw the assumptions and select confounders to get exchangeability for the intervention at each time point (directed acyclic graph with the generalized backdoor criterion or single world intervention graph with the normal backdoor criterion).
Fit a linear model that predicts observed outcomes from observed interventions and confounders.
Fit a linear model that predicts observed intervention from observed past interventions and confounders.
Fit linear models that predict observed confounders from observed past confounders and past interventions.
Fit a linear model that predicts competing events (like censoring) from relevant observed confounders.
Start from time zero with the observed confounder values, set intervention, predict competing events, and finally predict outcomes (unless only outcome at the end is relevant). Go to the next time step, predict confounder values, set intervention, predict competing events, and predict outcome. Repeat until simulation is finished.
Compute mean predicted outcomes for each intervention and compare the means to get an estimate of effect.

So, I'm familiar with the R language and downloaded a g-formula implementation from this GitHub repository. This package has been explained by the authors in this article. But let's just try to look at the source code itself since it's openly available.

First we can look at the R folder which contains the following files:

$ tree
├── bootstrap.R
├── comparisons.R
├── data.R
├── gformula.R
├── globals.R
├── helper.R
├── histories.R
├── interventions.R
├── pred.R
├── restrictions.R
├── s3methods.R
└── simulate.R

We start from the main g-formula.R file and the main function and it's arguments.

gformula <- function(
	obs_data,
	id,
	time_points = NULL,
	time_name,
	# Time-varying covariates
	covnames,
	covtypes, # Normal, categorical, etc.
	covparams, # Model formula etc. for each covariate 
	covfits_custom = NA,
	covpredict_custom = NA,
	# Fixed covariates
	basecovs = NA,
	# History functions for lagged terms etc. 
	histvars = NULL,
	histories = NA,
	baselags = FALSE,
	# Outcome model
	outcome_name,
	outcome_type, # Continuous end-of-follow, survival, etc.
	ymodel, # Model formula for outcome
	# Competing events
	compevent_name = NULL,
	compevent_model = NA,
	compevent_cens = FALSE,
	# Censoring (also a competing event)
	censor_name = NULL,
	censor_model = NA,
	# Interventions
	intvars = NULL,
	interventions = NULL,
	int_times = NULL,
	int_descript = NULL,
	ref_int = 0, # Reference intervention.
	int_visit_type = NULL, # For carry forward restriction
	intcomp = NA,
	# Visit process (explain later)
	visitprocess = NA,
	# Restrictions on distributions
	restrictions = NA,
	yrestrictions = NA,
	compevent_restrictions = NA,
	# Bootstrapping settings
	nsimul = NA,
	sim_data_b = FALSE,
	seed,
	nsamples = 0,
	parallel = FALSE,
	ncores = NA,
	ci_method = 'percentile',
	threads,
	model_fits = FALSE,
	boot_diag = FALSE,
	show_progress = TRUE,
	# Inverse-probability weight truncation
	ipw_cutoff_quantile = NULL,
	ipw_cutoff_value = NULL,
	...
)

Let's first look at a usage example. It clarifies the meaning of the main arguments.

gformula(obs_data = obs_data,
         id = "id",
         # This is the k in the equation.
         time_points = 4,
         time_name = "t0",
         # Survival outcome Y.
         outcome_name = "Y",
         outcome_type = "survival",
         # One time-fixed confounder L3.
         basecovs = c('L3'),
         # Two time-varying confounders L1 and L2. Observed intervention A.
         covnames = c('L1', 'L2', 'A'),
         covtypes = c('binary', 'bounded normal', 'binary'),
         # For all covariates, we need the k-1 value.
         # For L1 and L2, smoothing is done with a cumulative average.
         histories = c(lagged, lagavg),
         histvars = list(c('A', 'L1', 'L2'), c('L1', 'L2')),
         # The conditional probabilities come from these "covariate" models.
         # All are conditional on all other covariates at previous time point.
         covparams = list(covmodels = c(
             # "lag1" means "k-1", "cumavg" means cumulative average.
             # One model is fit to the whole data.
             # One model predicts for all time points as "t0" is set to "k".
             L1 ~ t0 + lag1_A + lag_cumavg1_L1 + lag_cumavg1_L2 + L3,
             L2 ~ t0 + lag1_A + L1 + lag_cumavg1_L1 + lag_cumavg1_L2 + L3,
             A ~ t0 + lag1_A + L1 + L2 + lag_cumavg1_L1 + lag_cumavg1_L2 + L3
         )),
         # The expectations come from this outcome model.
         # Time is a parameter along with all covariates.
         # Lagged terms are added again to improve the fit, allowing delayed effect.
         # Remember that L1-3 are included only because they are a sufficient
         # adjustment set for the effect of interest.
         ymodel = Y ~ t0 + A + L1 + L2 + L3 + lag1_A + lag1_L1 + lag1_L2,
         # Interventions of interest are "A == 1" always and "A == 0" always.
         intvars = list('A', 'A'),
         interventions = list(
             list(c(static, c(0,0,0,0))),
             list(c(static, c(1,1,1,1)))
         ),
         int_descript = c('Never treat', 'Always treat'))

This example is written without reference to the causal assumptions made in order to specify all the models. But we can try to reconstruct some monstrosity to make it clearer.

Let's go through the code now.

We can see that there are separate main-functions for continuous end-of-follow-up, binary end-of-follow-up, and survival outcomes. Let's focus on the binary end-of-follow function and go through it piece by piece.

First, based on the covariate, censoring, and outcomes models, some indicators are created for where to use those lagged and lagged cumulatively averaged terms in these different models. This is just housekeeping.

# Covariate model
  lag_indicator <- lagavg_indicator <- cumavg_indicator <- c()
  lag_indicator <- update_lag_indicator(covparams$covmodels, lag_indicator)
  lagavg_indicator <- update_lagavg_indicator(covparams$covmodels, lagavg_indicator)
  cumavg_indicator <- update_cumavg_indicator(covparams$covmodels, cumavg_indicator)

  # Outcome model
  if (!missing(ymodel)){
    lag_indicator <- update_lag_indicator(ymodel, lag_indicator)
    lagavg_indicator <- update_lagavg_indicator(ymodel, lagavg_indicator)
    cumavg_indicator <- update_cumavg_indicator(ymodel, cumavg_indicator)
  }

  # Censoring model
  censor <- !(length(censor_model) == 1 && is.na(censor_model))

  if (censor){
    lag_indicator <- update_lag_indicator(censor_model, lag_indicator)
    lagavg_indicator <- update_lagavg_indicator(censor_model, lagavg_indicator)
    cumavg_indicator <- update_cumavg_indicator(censor_model, cumavg_indicator)
  }

  histvals <- list(lag_indicator = lag_indicator, 
                   lagavg_indicator = lagavg_indicator,
                   cumavg_indicator = cumavg_indicator)

Sidenote: Lagged terms are common in time-dependent models in general because it would be too much assume that the current outcome is independent of the previous values of the cause. How effects happen over time could be quite complicated. Cumulative averages are also common to achieve smoothing over time. A cumulative sum could be fine too if there's reason to assume that causes accumulate and so the whole history of the cause could be summarized just into a single number without all those additional terms.

After some multithreading setup, missing argument values are defined.

# No competing events.
  comprisk <- FALSE
  comprisk2 <- FALSE
  compevent_model <- NA
  compevent2_model <- NA
  compevent_name <- NULL
  compevent2_name <- NULL
  compevent_restrictions <- NA

  # Outcome type
  outcome_type <- 'binary_eof'

  # No hazard ratios since only outcome at the end is of interest.
  # Hazard means the risk of event at a single time point.
  # You can compute the hazards and turn them into a survival function.
  hazardratio <- FALSE
  intcomp <- NA

  min_time <- min(obs_data[[time_name]])
  below_zero_indicator <- min_time < 0

  # Copy of data to mutate it.
  obs_data <- copy(obs_data)

After checking that all the arguments are as expected (which we can skip), the function turns to the visit process. It literally means the visits where all those variables are measured and interventions are given. When you add a visit process in this method, you are imputing missing values until too much many time steps have passed.

max_visits <- NA
  if (!is.na(visitprocess[[1]][[1]])){
    for (vp in visitprocess){
      # A visit process adds two restrictions on the distribution of the covariate.
      # Effectively it imputes missing values from previous visits until there are too many.
      restrictions <- c(
        restrictions[!is.na(restrictions)],
        list(
          # When individual has missed too many visits, visit variable is set to 1.
          c(vp[1], paste("lag1_ts_", vp[1], "!=",vp[3], sep = ""), simple_restriction, 1),
          # If hasn't missed enough visits, values are carried forward from the previous visit.
          c(vp[2], paste(vp[1], "==1", sep = ""), carry_forward)
        )
      )

      if (is.na(max_visits[1])){
        # Maximum number of visits one can miss before being marked as censored.
        max_visits <- as.numeric(vp[3])
      } else {
        max_visits <- c(max_visits, as.numeric(vp[3]))
      }

      # This keeps track of the visit count.
      if (is.na(histories[1])){
        histories <- c(visit_sum)
      } else {
        histories <- c(visit_sum, histories)
        histvars <- append(list(c(vp[1])),histvars)
      }
    }
  }

Now the function continues by generating data for all those history terms in the models.

for (t in 0:max(obs_data[[time_name]])) {
    make_histories(pool = obs_data,
                   histvars = histvars,
                   histvals = histvals,
                   histories = histories,
                   time_name = time_name,
                   t = t,
                   id = id,
                   baselags = baselags,
                   below_zero_indicator = below_zero_indicator)
  }

  sample_size <- length(unique(obs_data[[id]]))
  time_points <- max(obs_data[[time_name]])+1

Some of the covariate types can be defined as absorbing. The next part shows what it means.

for (i in seq_along(covnames)){
    if (covtypes[i] == 'absorbing'){
      # If covariate is absorbing, add a restriction for this covariate.
      restrictions <- c(
        restrictions[!is.na(restrictions)],
        # If covariate was not 0 previously, use that value now too.
        list(c(covnames[i], paste("lag1_", covnames[i], "==0", sep = ""), carry_forward, 1))
      )
      # And mark the covariate as binary.
      covtypes[i] <- 'binary'
    }
  }

I skipped some more housekeeping-like code. Next the function fits all the models. Remember that these models are used to simulate the causal graph so that the interventions are sequentially exchangeable and contrasts between simulated outcomes estimate the counterfactual effects of interest. So each model has to adjust for the right confounder values over time.

if (time_points > 1){
    # This function fits all the covariate models, including observed intervention. 
    fitcov <- pred_fun_cov(covparams = covparams,
                           covnames = covnames,
                           covtypes = covtypes,
                           covfits_custom = covfits_custom,
                           restrictions = restrictions,
                           time_name = time_name,
                           obs_data = obs_data_geq_0,
                           model_fits = model_fits)
    names(fitcov) <- covnames
  } else {
    fitcov <- NULL
  }
  # Similarly, this function fits the outcome model. We'll look at it next.
  fitY <- pred_fun_Y(ymodel,
                     yrestrictions,
                     outcome_type,
                     outcome_name,
                     time_name,
                     obs_data_geq_0,
                     model_fits = model_fits)
  if (censor){
    # And third, this function fits the censoring (competing event) model.
    fitC <- pred_fun_D(censor_model, NA, obs_data_geq_0, model_fits = model_fits)
  } else {
    fitC <- NA
  }

These functions are defined in another file preds.R so let's go there. We start from the covariate model fitting function pred_fun_cov. I copy it below and add explanations as comments.

# This is just organizing the restriction information. Go further.
  if (!is.na(restrictions[[1]][[1]])){
    restrictnames <- lapply(seq_along(restrictions), FUN = function(r){
      restrictions[[r]][[1]]
    })
    conditions <- lapply(seq_along(restrictions), FUN = function(r){
      restrictions[[r]][[2]]
    })
  } else {
    restrictnames <- conditions <- NA
  }

  # Time 0 is not modelled. Simulation starts from the observed values.
  # Predictions are made only for step 1.
  subdata <- obs_data[obs_data[[time_name]] > 0]

  fits <- lapply(seq_along(covnames), FUN = function(j){
    # Restrictions are handled first for each covariate.
    if (!is.na(restrictions[[1]][[1]])){
      if (covnames[j] %in% restrictnames){
        i <- which(restrictnames %in% covnames[j])
        condition <- ""
        if (length(i) > 1){
          for (k in i){
            condition_var <- sub("<.*$", "", restrictions[[k]][[2]])
            condition_var <- sub(">.*$", "", condition_var)
            condition_var <- sub("=.*$", "", condition_var)
            condition_var <- sub("!.*$", "", condition_var)
            condition_var <- sub("%in%.*$", "", condition_var)
            condition_var <- sub(" ", "", condition_var)
            if (condition_var %in% names(obs_data)){
              if (condition[1] == ""){
                condition <- conditions[k]
              } else {
                # Multiple restrictions on the same covariate is possible.
                # Their conditions need to be combined with OR (||).
                condition <- paste(condition, conditions[k], sep = "||")
              }
            }
          }
        } else {
          condition_var <- sub("<.*$", "", restrictions[[i]][[2]])
          condition_var <- sub(">.*$", "", condition_var)
          condition_var <- sub("=.*$", "", condition_var)
          condition_var <- sub("!.*$", "", condition_var)
          condition_var <- sub("%in%.*$", "", condition_var)
          condition_var <- sub(" ", "", condition_var)
          if (condition_var %in% names(obs_data)){
            condition <- conditions[i]
          }
        }
        if (condition != ""){
          # Only data satifying the restriction condition are modelled.
          # Otherwise the value is determined by the restriction during simulation.
          subdata <- subset(subdata, eval(parse(text = condition)))
        }
      }
    }

    # Now for different covariate types, different generalized linear models are fit.
    # These functions are defined in the same file and we'll look at one next.
    if (covtypes[j] == 'binary'){
      fit_glm(covparams = covparams,
              covfam = 'binomial',
              obs_data = subdata,
              j = j,
              model_fits = model_fits)
    } else if (covtypes[j] == 'normal'){
      fit_glm(covparams = covparams,
              covfam = 'gaussian',
              obs_data = subdata,
              j = j,
              model_fits = model_fits)
    } else if (covtypes[j] == 'categorical'){
      fit_multinomial(covparams,
                      obs_data = subdata,
                      j = j,
                      model_fits = model_fits)
    } else if (covtypes[j] == 'zero-inflated normal'){
      fit_zeroinfl_normal(covparams,
                          covname = covnames[j],
                          obs_data = subdata,
                          j = j,
                          model_fits = model_fits)
    } else if (covtypes[j] == 'bounded normal'){
      fit_bounded_continuous(covparams,
                             covname = covnames[j],
                             obs_data = subdata,
                             j = j,
                             model_fits = model_fits)
    } else if (covtypes[j] == 'truncated normal'){
      fit_trunc_normal(covparams = covparams,
                       obs_data = subdata,
                       j = j,
                       model_fits = model_fits)
    } else if (covtypes[j] == 'custom'){
      covfits_custom[[j]](covparams,
                          covname = covnames[j],
                          obs_data = subdata,
                          j = j)
    }
  })
  return (fits)

So next we will look at one generalized linear model fit closer to understand what all of them do. Let's see fit_zeroinfl_normal.

# This object contains the formulas.
  covmodels <- covparams$covmodels
  # And this one optional link functions. Otherwise the default will be used.
  if (!is.null(covparams$covlink)){
    covlink <- covparams$covlink
  }

  if (is.na(covlink[j])){
    famtext <- "gaussian()"
  } else {
    famtext <- paste("gaussian(link = ", covlink[j], ")", sep = "")
  }

  # Make an indicator variable that says if the covariate is 0.
  obs_data[, paste("I_", covname, sep = "")] <- obs_data[[covname]]
  obs_data[[paste("I_", covname, sep = "")]][obs_data[[covname]] != 0] <- 1

  # Covariate is log-transformed here.
  obs_data[, paste("log_", covname, sep = "")] <- 0
  obs_data[obs_data[[covname]] != 0][, paste("log_", covname, sep = "")] <- log(obs_data[obs_data[[covname]] != 0][[covname]])
  
  # "control" optionally contains settings that affect the fitting process.
  if (!is.null(covparams$control) && !is.na(covparams$control[j]) && !is.null(covparams$control[j][[1]])){
    # Nevertheless, this model just predicts covariate being 0 or not.
    fit1 <- stats::glm(stats::as.formula(paste("I_", covmodels[j], sep = "")), 
                       family = stats::binomial(),
                       control = covparams$control[j][[1]], 
                       data = obs_data)
  } else {
    fit1 <- stats::glm(stats::as.formula(paste("I_", covmodels[j], sep = "")), 
                       family = stats::binomial(),
                       data = obs_data)
  }

  # Again control is just for optional fitting settings.
  if (!is.null(covparams$control) && !is.na(covparams$control[j]) && !is.null(covparams$control[j][[2]])){
    # This model predicts the log of the covariate based on the supplied formula.
    fit2 <- stats::glm(stats::as.formula(paste("log_", covmodels[j], sep = "")),
                       # With Gaussian error.
                       family = eval(parse(text = famtext)),
                       control = covparams$control[j][[2]],
                       # Using the non-zero data only.
                       data = obs_data[obs_data[[covname]] != 0])
  } else {
    fit2 <- stats::glm(stats::as.formula(paste("log_", covmodels[j], sep = "")),
                       family = eval(parse(text = famtext)),
                       data = obs_data[obs_data[[covname]] != 0])
  }

  # This is just organizing the resulting object containing both model fits.
  # RMSE means the root mean squared error between observed and predicted values.
  # It will be used an estimate of the standard deviation of the Gaussian distribution
  # this model predicts the means for.
  fit1$rmse <- add_rmse(fit1)
  fit2$rmse <- add_rmse(fit2)
  fit1$stderr <- add_stderr(fit1)
  fit2$stderr <- add_stderr(fit2)
  fit1$vcov <- add_vcov(fit1)
  fit2$vcov <- add_vcov(fit2)
  if (!model_fits){
    fit1 <- trim_glm(fit1)
    fit2 <- trim_glm(fit2)
  }

  return (list(fit1, fit2))

So this was the pred_fun_cov. How about pred_fun_Y (outcome) and pred_fun_D (censoring)? Let's look at both of them now.

This is pred_fun_Y.

# If outcome is continuous, error is Gaussian. If binary or survival, there will be
  # a default link function into a binomial distribution.
  if (outcome_type == 'continuous' || outcome_type == 'continuous_eof'){
    outcome_fam <- stats::gaussian()
  } else if (outcome_type == 'survival' || outcome_type == 'binary_eof'){
    outcome_fam <- stats::binomial()
  }

  # If the outcome is measured at the end of follow-up,
  # only data for the last time point are used in this model fit.
  if (grepl('eof', outcome_type)){
    obs_data <- obs_data[obs_data[[time_name]] == max(unique(obs_data[[time_name]]))]
  }

  if (!is.na(yrestrictions[[1]][[1]])){ 
    condition <- yrestrictions[[1]][1]
    if (length(yrestrictions) > 1){
      for (yrestriction in yrestrictions[-1]){
        # Multiple restrictions are combined with OR (||).
        condition <- paste(condition, yrestriction[1], sep = "||")
      }
    }

    if (grepl('continuous', outcome_type)){

      # Continuous outcome is range-normalized to 0-1.
      obs_data[, paste("norm_", outcome_name, sep = "")] <-
        (obs_data[[outcome_name]] - min(obs_data[[outcome_name]])) / 
        (max(obs_data[[outcome_name]]) - min(obs_data[[outcome_name]]))

      fitY <- stats::glm(stats::as.formula(paste("norm_", model, sep = "")),
                         family = outcome_fam,
                         # Only data outside of the restriction is modelled.
                         # Otherwise the value is determined by restriction during simulation.
                         data = subset(obs_data, eval(parse(text = condition))), 
                         # This argument adds response values to the result.
                         y = TRUE)

    } else {
      fitY <- stats::glm(model,
                         family = outcome_fam,
                         data = subset(obs_data, eval(parse(text = condition))), 
                         y = TRUE)
    }

  } else { 
    fitY <- stats::glm(model, family = outcome_fam, data = obs_data, y = TRUE)
  }

  # This is just adding results to the returned object.
  fitY$rmse <- add_rmse(fitY)
  fitY$stderr <- add_stderr(fitY)
  fitY$vcov <- add_vcov(fitY)
  if (!model_fits){
    fitY <- trim_glm(fitY)
  }
  return (fitY)

This is pred_fun_D (for competing outcomes, including censoring). It's identical.

if (!is.na(compevent_restrictions[[1]][[1]])){
    # Again restrictions just tell which kind of values are not modelled.
    # The values are otherwise known during simulation.
    condition <- compevent_restrictions[[1]][1]
    if (length(compevent_restrictions) > 1){
      # Multiple conditions are combined with OR (||).
      for (compevent_restriction in compevent_restrictions[-1]){
        condition <- paste(condition, compevent_restriction[1], sep = "||")
      }
    }

    # This model predicts the competing event (binary) based on the supplied formula for it.
    fitD <- stats::glm(model, 
                       family = stats::binomial(),
                       # Data outside restrictions are modelled only.
                       data = subset(obs_data, eval(parse(text = condition))), 
                       y = TRUE)

  } else { 
    fitD <- stats::glm(model, family = stats::binomial(), data = obs_data, y = TRUE)
  }

  # This is just adding results to the returned object.
  fitD$rmse <- add_rmse(fitD)
  fitD$stderr <- add_stderr(fitD)
  fitD$vcov <- add_vcov(fitD)
  if (!model_fits){
    fitD <- trim_glm(fitD)
  }
  return (fitD)

So for each covariate, fitted models are computed based on the covariate types, for data outside restrictions, for time points after zero. The same happens for the outcome and the competing events (censoring or otherwise) accounting for restrictions and the end of follow-up for those outcomes.

Now we get back to the gformula_binary_eof function. The next step is related to bootstrapping again so we skip it and continue to some intervention related code.

# Natural course is included in the list of interventions by default.
  # It basically asks you to imagine that all the observed values
  # could be known in advance, and then used as the intervention. You can use this 
  # as a reference for more interesting strategies of your own making. 
  # Anyhow, it's defined in the function "natural" which is actually empty.
  # Natural course is the default and then other kind of intervention functions modify it.
  if (!is.null(interventions)){
    comb_interventions <- c(list(list(c(natural))), interventions)
    comb_intvars <- c(list('none'), intvars)
  } else {
    comb_interventions <- list(list(c(natural)))
    comb_intvars <- list('none')
  }

  # "int_times" refers to the time points where the intervention is applied.
  # By default, they are applied at all time points until the second last.
  if (is.null(int_times)){
    comb_int_times <- list()
    for (i in seq_along(comb_interventions)){
      comb_int_times[[i]] <- lapply(seq_along(comb_interventions[[i]]),
                                    FUN = function(i) {0:(time_points - 1)})
    }
  } else {
    comb_int_times <- c(list(list(0:(time_points - 1))), int_times)
  }

  if (is.null(int_visit_type)){
    int_visit_type <- rep(TRUE, length(comb_interventions))
  } else {
    # It's possible to carry forward the natural value instead of the intervention value
    # when using the "carry_forward" restriction on intervention variables.
    int_visit_type <- c(TRUE, int_visit_type)
  }

Before moving on we should check the history functions. Take lagged as an example:

for (i in histvals){
    # Baseline value.
    if (t < i & baselags & !below_zero_indicator){
      current_ids <- unique(pool[pool[[time_name]] == t][[id_name]])
      lapply(histvars, FUN = function(histvar){
        classtmp <- class(pool[pool[[time_name]] == t][[histvar]])
        myclass <- paste('as.', classtmp, sep = "")
        pool[pool[[time_name]] == t, (paste("lag", i, "_", histvar, sep = "")) :=
               get(myclass)(pool[pool[[time_name]] == 0 & get(id_name) %in% current_ids][[histvar]])]
      })
    # Reference value or zero.
    } else if (t < i & !below_zero_indicator){
      lapply(histvars, FUN = function(histvar){
        classtmp <- class(pool[pool[[time_name]] == t][[histvar]])
        myclass <- paste('as.', classtmp, sep = "")
        if (is.factor(pool[pool[[time_name]] == t][[histvar]])){
          reflevel <- levels(pool[pool[[time_name]] == t][[histvar]])[1]
          pool[pool[[time_name]] == t, (paste("lag", i, "_", histvar, sep = "")) := reflevel]
        } else {
          pool[pool[[time_name]] == t, (paste("lag", i, "_", histvar, sep = "")) := get(myclass)(0)]
        }
      })
    # Previous value.
    } else {
      current_ids <- unique(pool[pool[[time_name]] == t][[id_name]])
      lapply(histvars, FUN = function(histvar){
        classtmp <- class(pool[pool[[time_name]] == t][[histvar]])
        myclass <- paste('as.', classtmp, sep = "")
        pool[pool[[time_name]] == t, (paste("lag", i, "_", histvar, sep = "")) :=
               get(myclass)(pool[pool[[time_name]] == t - i & get(id_name) %in% current_ids][[histvar]])]
      })
    }
  }

Okay, we are now clearly getting to the meat of this function – the simulation. This will also help us understand all the small pieces together.

if (parallel){
    # We skip the parallel version.
  } else {
    # There will a simulation for each supplied intervention.
    pools <- lapply(seq_along(comb_interventions), FUN = function(i){
      simulate(fitcov = fitcov,
               fitY = fitY,
               fitD = NA,
               yrestrictions = yrestrictions,
               compevent_restrictions = compevent_restrictions,
               restrictions = restrictions,
               outcome_name = outcome_name,
               compevent_name = compevent_name,
               time_name = time_name,
               intvars = comb_intvars[[i]],
               interventions = comb_interventions[[i]],
               int_times = comb_int_times[[i]],
               histvars = histvars,
               histvals = histvals,
               histories = histories,
               covparams = covparams,
               covnames = covnames,
               covtypes = covtypes,
               covpredict_custom = covpredict_custom,
               basecovs = basecovs,
               comprisk = comprisk,
               ranges = ranges,
               outcome_type = outcome_type,
               subseed = subseed,
               time_points = time_points,
               obs_data = obs_data,
               parallel = parallel,
               baselags = baselags,
               below_zero_indicator = below_zero_indicator,
               min_time = min_time,
               show_progress = FALSE,
               int_visit_type = int_visit_type[i],
               ...)
    })
  }

So next we will look closely at this simulate function in the file simulate.R.

# These are just setting up variables. Go to the start of the loop.
  if (!is.null(fitcov)){
    rmses <- lapply(seq_along(fitcov), 
                    FUN = rmse_calculate, 
                    fits = fitcov, 
                    covnames = covnames,
                    covtypes = covtypes)
  }
  ids_unique <- unique(obs_data$newid)
  data_len <- length(ids_unique)
  restrict_ids <- rep(0, data_len)
  restrict_counts <- rep(list(rep(0, data_len)), length(restrictions))
  if (!is.na(restrictions[[1]][[1]])){
    restrict_covs <- lapply(restrictions, FUN = function(restriction){restriction[[1]]})
  }

  nat_course <- length(intvar == 1) && intvar == 'none'
  # History functions for intervention variables.
  if (!nat_course) {
    intvar_vec <- unique(unlist(intvar))
    histvars_int <- histories_int <- rep(list(NA), length(histvars))
    for (l in seq_along(histvars)){
      histvars_temp <- histvars[[l]][histvars[[l]] %in% intvar_vec]
      if (length(histvars_temp) > 0){
        histvars_int[[l]] <- histvars_temp
        histories_int[[l]] <- histories[[l]]
      }
    }
    histvars_int <- histvars_int[!is.na(histvars_int)]
    histories_int <- histories_int[!is.na(histories_int)]
  }

  # Simulation goes from time 0 to the last time point.
  for (t in ((1:time_points) - 1)){

    # The initial time point is different from the rest.
    if (t == 0){

      # Covariates (confounders and observed intervention) take their observed values at time 0 and before.
      # Note that time can be negative. All these are baseline data.
      if (!is.na(basecovs[[1]])){
        pool <- obs_data[obs_data[[time_name]] <= t, ][, .SD, .SDcols = c(covnames, basecovs, time_name)]
      } else {
        pool <- obs_data[obs_data[[time_name]] <= t, ][, .SD, .SDcols = c(covnames, time_name)]
      }

      set(pool, j = 'id', value = rep(ids_unique, each = 1 - min_time))

      # Eligibility criteria must be met at time 0.
      set(pool, j = 'eligible_pt', value = TRUE)

      # This just reorders the columns.
      if (!is.na(basecovs[[1]])){
        setcolorder(pool, c('id', time_name, covnames, basecovs))
      } else {
        setcolorder(pool, c('id', time_name, covnames))
      }

      newdf <- pool[pool[[time_name]] == 0]
      if (!nat_course){
        mycols <- match(intvar, names(newdf))
        temp_intvar <- newdf[, ..mycols]

        # For interventions whose natural values will carry forward,
        # a new "_natural" suffixed variable is created here.
        if (!int_visit_type){
          for (var in intvar){
            newdf[, eval(paste0(var, '_natural')) := newdf[[var]]]
          }
        }
      }

      # This runs the intervention function and adds results to simulated data.
      # E.g. "static" just sets the value given in the "intervention" object.
      # But "threshold" is dynamic, thresholding the observed value within a range.
      intfunc(newdf, pool = pool, intervention, intvar, unlist(int_time), time_name, t)

      # New variable "intervened" is added. This is true when the observed value
      # is the same as the intervention value that was just added above.
      # For natural course it's false (0).
      intervened <- rep(0, times = nrow(newdf))
      if (!nat_course){
        for (var in intvar){
          intervened <- intervened + (abs(temp_intvar[[var]] - newdf[[var]]) > 1e-6)
        }
        intervened <- ifelse(newdf$eligible_pt, intervened >= 1, NA)
      }
      set(newdf, j = 'intervened', value = intervened)

      # "pool" recorded the observed covariate values up to this time point.
      # Here the intervention values are added to the simulation.
      if (ncol(newdf) > ncol(pool)){
        pool <- rbind(pool[pool[[time_name]] < t], newdf, fill = TRUE)
        pool <- pool[order(id, get(time_name))]
      } else {
        # Only this branch can run.
        pool[pool[[time_name]] == t] <- newdf
      }

      # Values for lagged etc. history terms are generated here.
      # All the models expect them.
      make_histories(pool = pool,
                     histvars = histvars,
                     histvals = histvals,
                     histories = histories,
                     time_name = time_name,
                     t = t,
                     id = 'id',
                     max_visits = max_visits,
                     baselags = baselags,
                     below_zero_indicator = below_zero_indicator)

      newdf <- pool[pool[[time_name]] == t]

      # If the outcome is survival, the outcome will be predicted for every time point.
      # Otherwise the outcome will be predicted only at the last time point (end of follow-up).
      if (outcome_type == 'survival'){
        set(newdf, j = 'Py', value = stats::predict(fitY, type = 'response', newdata = newdf))
      } else if (outcome_type == 'continuous_eof'){
        if (t < (time_points - 1)){
          set(newdf, j = 'Ey', value = as.double(NA))
        } else if (t == (time_points - 1)){
          set(newdf, j = 'Ey', value = stats::predict(fitY, type = 'response', newdata = newdf))
        }
      } else if (outcome_type == 'binary_eof'){
        if (t < (time_points - 1)){
          set(newdf, j = 'Py', value = as.double(NA))
        } else if (t == (time_points - 1)){
          set(newdf, j = 'Py', value = stats::predict(fitY, type = 'response', newdata = newdf))
        }
      }

      # For survival outcomes that already have predictions at time 0, restrictions are applied.
      # On rows that don't satisfy the given condition, predicted values are replaced
      # with supplied values.
      if (!is.na(yrestrictions[[1]][[1]])){ 
        for (yrestriction in yrestrictions){
          if (outcome_type == 'survival'){
            newdf[
	      !eval(parse(text = paste("newdf$", yrestriction[1]))), 
	      "Py" := as.double(yrestriction[2])
	    ]
          }
        }
      }

      # Predictions for binary variables are probabilities so the actual value
      # need to be sampled randomly from the corresponding distribution. 
      # In other words, has to be simulated.
      if (outcome_type == 'survival'){
        set(newdf, j = 'Y', value = stats::rbinom(data_len, 1, newdf$Py))
      }

      # Next competing events are handled for survival outcomes that have predictions.
      # Simply, probability of competing event is predicted with the model and
      # value simulated from a distribution. For those who got the event, the outcome is set to missing.
      if (outcome_type == 'survival')
      {
        if (comprisk){
          set(newdf, j = 'Pd', value = stats::predict(fitD, type = 'response', newdata = newdf))

          # Restrictions are handled as before. Again, they mean that
          # when the given condition doesn't hold, predicted probability is replaced with
          # the user supplied value. The user must have some deterministic knowledge.
          if (!is.na(compevent_restrictions[[1]][[1]])){ 
            for (compevent_restriction in compevent_restrictions){
              newdf[
                  !eval(parse(text = compevent_restriction[1])),
                  "Pd" := as.double(compevent_restriction[2])
	      ]
            }
          }

          set(newdf, j = 'D', value = stats::rbinom(data_len, 1, newdf$Pd))

          set(newdf, j = 'prodp1', value = newdf$Py * (1 - newdf$Pd))
          set(newdf, j = 'prodd0', value = 1 - newdf$Pd)
	
        } else {
          # No competing events (D is for death).
          set(newdf, j = 'D', value = 0)
          set(newdf, j = 'prodp1', value = newdf$Py)
        }

        # If competing event occured in the simulation, set the outcome missing (NA).
        set(newdf[newdf$D == 1], j = 'Y', value = NA)
        set(newdf, j = 'prodp0', value = 1 - newdf$Py)
        set(newdf, j = 'poprisk', value = newdf$prodp1)

      }

      # Outcome predictions are again added to the "pool" that records the simulation.
      pool <- rbind(pool[pool[[time_name]] < t], newdf, fill = TRUE)
      pool <- pool[order(id, get(time_name))]

      col_types <- sapply(pool, class)

      # Now we have simulated data for time zero.

    # This block is for all t > 0.
    } else {

      # Start from an initial state for the simulation from the previous time step.
      newdf <- pool[pool[[time_name]] == t - 1]
      set(newdf, j = time_name, value = rep(t, data_len))

      if ('categorical time' %in% covtypes){
        time_name_f <- paste(time_name, "_f", sep = "")
        newdf[, (time_name_f) := obs_data[get(time_name) == t, get(time_name_f)][1]]
      }

      # "pool" keeps track of all time steps.
      pool <- rbind(newdf, pool)

      # Again generate values for lagged etc. history terms in the models.
      make_histories(pool = pool,
                     histvars = histvars,
                     histvals = histvals,
                     histories = histories,
                     time_name = time_name,
                     t = t,
                     id = 'id',
                     max_visits = max_visits,
                     baselags = baselags,
                     below_zero_indicator = below_zero_indicator)

      # Now each covariate value (including observed intervention) will be simulated (predicted).
      # Note that only data for the current time point is used but it contains the specified histories.
      newdf <- pool[pool[[time_name]] == t]
      for (i in seq_along(covnames)){
        cast <- get(paste0('as.',unname(col_types[covnames[i]])))

        if (covtypes[i] == 'binary'){
          set(
              newdf, 
              j = covnames[i], 
              value = cast(
                  # Prediction works just like before. Model predicts the probability and
                  # then the value is sampled from the distribution. 
                  predict_binomial(data_len, 1, stats::predict(fitcov[[i]], type = 'response', newdata = newdf))
              )
          )

        } else if (covtypes[i] == 'normal'){
          set(
              newdf, 
              j = covnames[i],
              value = cast(
	        predict_normal(data_len, stats::predict(fitcov[[i]], type = 'response', newdata = newdf), est_sd = rmses[[i]])
	      )
	  )
        } else if (covtypes[i] == 'categorical'){
          # Skipped.
        } else if (covtypes[i] == 'zero-inflated normal'){
          # Zero-inflated normal is just a combination (mixture) of a binary zero-indicator and a Gaussian.
          set(
            newdf, 
            j = paste("I_", covnames[i], sep = ""),
            value = predict_binomial(data_len, 1, stats::predict(fitcov[[i]][[1]], type = 'response', newdata = newdf))
          )
          
          set(
            newdf, 
            j = covnames[i],
            value = cast(
                # For non-zeros, the model predicts a logged value which need to be exponentiated.
                exp(predict_normal(
                  data_len, 
                  stats::predict(fitcov[[i]][[2]], type = 'response', newdata = newdf), 
                  # Sampling from a Gaussian requires a mean (predicted value) but also
                  # a standard deviation. The root mean square errors are used for this.
                  est_sd =  rmses[[i]]
                ))
            )
          )
          set(newdf, j = covnames[i], value = cast(newdf[[paste("I_", covnames[i], sep = "")]] * newdf[[covnames[i]]]))
          set(newdf, j = paste("I_", covnames[i], sep = ""), value = NULL)

        } else if (covtypes[i] == 'bounded normal'){
          # Very long because of restrictions. Not relevant now.
        } else if (covtypes[i] == 'truncated normal'){
          if (covparams$direction[i] == 'left'){
            set(newdf, j = covnames[i],
                value = cast(predict_trunc_normal(data_len, stats::predict(fitcov[[i]], type = 'response', newdata = newdf), est_sd = rmses[[i]], a = covparams$point[i], b = Inf)))
          } else if (covparams$direction[i] == 'right'){
            set(newdf, j = covnames[i],
                value = cast(predict_trunc_normal(data_len, stats::predict(fitcov[[i]], type = 'response', newdata = newdf), est_sd = rmses[[i]], a = - Inf, b = covparams$point[i])))
          }
        } else if (covtypes[i] == 'custom'){
          # Deleted custom here.
        } else if (covtypes[i] == 'zero-inflated normal') {
          # Deleted zero-inflated here.
        }

        # Restrictions are applied to the simulated covariate values now.
        # Predicted values are replaced by the restriction-implied values.
        if (!is.na(restrictions[[1]][[1]])){
          lapply(seq_along(restrictions), FUN = function(r){
            # First piece tells the covariate name.
            if (restrictions[[r]][[1]] == covnames[i]){ 
              # Second piece tells the condition under which restrictions apply.
              restrict_ids <- newdf[!eval(parse(text = restrictions[[r]][[2]]))]$id
              if (length(restrict_ids) != 0){
                # Third piece gives the function to apply.
                restrictions[[r]][[3]](
                    newdf, 
                    pool[pool[[time_name]] < t & pool[[time_name]] >= 0], 
                    restrictions[[r]], 
                    time_name, 
                    t, 
                    int_visit_type, 
                    intvar
                )
              }
            }
          })
        }

        # Covariate results are added to the record.
        pool[pool[[time_name]] == t] <- newdf

        # Again, lagged etc. history terms are added to the data for the models.
        if (covnames[i] %in% unlist(histvars)){
          ind <- unlist(lapply(histvars, FUN = function(x) {
            covnames[i] %in% x
          }))

          make_histories(pool = pool,
                         histvars = rep(list(covnames[i]),
                         sum(ind)),
                         histvals = histvals,
                         histories = histories[ind],
                         time_name = time_name,
                         t = t,
                         id = 'id',
                         max_visits = max_visits,
                         baselags = baselags,
                         below_zero_indicator = below_zero_indicator)
          newdf <- pool[pool[[time_name]] == t]

        }
      }

      # Intervention is again the same as at time 0. 
      newdf <- pool[pool[[time_name]] == t]

      if (!nat_course){

        mycols <- match(intvar, names(newdf))
        temp_intvar <- newdf[, ..mycols]

        if (!int_visit_type){
          for (var in intvar){
            newdf[, eval(paste0(var, '_natural')) := newdf[[var]]]
          }
        }
      }
      
      # This adds the intervention values to the simulation data.
      intfunc(newdf, pool, intervention, intvar, unlist(int_time), time_name, t)

      # "intervened" and "make_histories" skipped.

      # We have covariates and interventions and their terms so next is outcome modelling.
      # Again only data for this time point is used but histories carry the previous "lag1" values.
      newdf <- pool[pool[[time_name]] == t]
      if (outcome_type == 'survival'){

        if (comprisk){
          # Probability of a competing event is predicted at this time point.
          set(newdf, j = 'Pd', value = stats::predict(fitD, type = 'response', newdata = newdf))

          # Restrictions are applied to the predictions of competing events.
          if (!is.na(compevent_restrictions[[1]][[1]])){
            for (compevent_restriction in compevent_restrictions){
              newdf[!eval(parse(text = compevent_restriction[1])), "Pd" := as.double(compevent_restriction[2])]
            }
          }

          # Competing event is predicted.
          set(newdf, j = 'D', value = stats::rbinom(data_len, 1, newdf$Pd))

        } else {
          set(newdf, j = 'D', value = 0)
        }

        # Probability of outcome is predicted for this time point.
        set(newdf, j = 'Py', value = stats::predict(fitY, type = 'response', newdata = newdf))

      } else if (outcome_type == 'continuous_eof'){
        if (t < (time_points - 1)){
          set(newdf, j = 'Ey', value = as.double(NA))
        } else if (t == (time_points - 1)){
          # Only when it's the last time point, end-of-follow-up outcome mean is predicted.
          set(newdf, j = 'Ey', value = stats::predict(fitY, type = 'response', newdata = newdf))
        }
      } else if (outcome_type == 'binary_eof'){
        if (t < (time_points - 1)){
          set(newdf, j = 'Py', value = as.double(NA))
        } else if (t == (time_points - 1)){
	  # Only when it's the last time point, end-of-follow-up outcome probability is predicted.
          set(newdf, j = 'Py', value = stats::predict(fitY, type = 'response', newdata = newdf))
        }
      }

      # Predicted outcome values are replaced with supplied values if they don't satisfy
      # the condition supplied with outcome restriction.
      if (!is.na(yrestrictions[[1]][[1]])){ 
        for (yrestriction in yrestrictions){
          if (outcome_type == 'survival'){
            newdf[!eval(parse(text = paste("newdf$", yrestriction[1]))), "Py" := as.double(yrestriction[2])]
          } else if (outcome_type == 'continuous_eof' && t == (time_points - 1)){
            newdf[!eval(parse(text = paste("newdf$", yrestriction[1]))), "Ey" := as.double(yrestriction[2])]
          } else if (outcome_type == 'binary_eof' && t == (time_points - 1)){
            newdf[!eval(parse(text = paste("newdf$", yrestriction[1]))), "Py" := as.double(yrestriction[2])]
          }
        }
      }

      # Finally for survival outcomes, the predictions are turned into realizations by sampling and censoring.
      if (outcome_type == 'survival'){
        
        set(newdf, j = 'Y', value = stats::rbinom(data_len, 1, newdf$Py))
        newdf[newdf$D == 1, 'Y' := NA]

        if (comprisk){

          # Note that cumulative risk under competing events has to include the competing event
          # probability into the calculation. It follows basic probability theory: AND is multiplication.

          set(newdf, 
              j = 'prodp1',
              # Outcome AND No competing AND No outcome at t-1 AND No competing at t-1
              value = newdf$Py * (1 - newdf$Pd) * pool[pool[[time_name]] == t - 1, ]$prodp0 * pool[pool[[time_name]] == t - 1, ]$prodd0)
          
          set(newdf, j = 'prodd0', value = (1 - newdf$Pd) * pool[pool[[time_name]] == t - 1,]$prodd0)

        } else {

          set(newdf, 
              j = 'prodp1',
              value = newdf$Py * pool[pool[[time_name]] == t - 1,]$prodp0)

        }

        set(newdf, j = 'prodp0', value = (1 - newdf$Py) * pool[pool[[time_name]] == t - 1,]$prodp0)
        set(newdf, j = 'poprisk', value = pool[pool[[time_name]] == t - 1, ]$poprisk + newdf$prodp1)
      }

      # New values are again added to the record for this time point.
      pool[pool[[time_name]] == t] <- newdf
    }
  }

  colnames(pool)[colnames(pool) == time_name] <- 't0'
  setorder(pool, id, t0)
  colnames(pool)[colnames(pool) == 't0'] <- time_name

  pool <- pool[pool[[time_name]] >= 0]

  # Survival is one minus the "poprisk" which is a cumulative sum of
  # cumulative products of hazards over time. It is calculated above based
  # on the predicted probabilities of events for each time point. 
  if (outcome_type == 'survival'){
    pool[, 'survival' := 1 - pool$poprisk]
  }

  pool2 <- copy(pool)
  if (show_progress){
    pb$tick()
  }
  return (pool2)

Now, the rest of the main function gformula_binary_eof is for calculating the means and their comparisons and organizing all the results.

# "pools" is returned by "simulate".
  nat_pool <- pools[[1]] 
  pools <- pools[-1] 
  result_ratio <- result_diff <- int_result <- rep(NA, length(pools) + 1)

  # Mean outcome probability for natural course intervention.
  nat_result <- mean(nat_pool$Py, na.rm = TRUE)

  if (ref_int == 0){
    ref_mean <- nat_result
  } else {
    # Mean outcome probability of the reference intervention.
    ref_mean <- mean(pools[[ref_int]]$Py, na.rm = TRUE)
  }

  int_result[1] <- nat_result

  # Mean outcome probability for all other interventions.
  int_result[-1] <- sapply(pools, FUN = function(pool){mean(pool$Py, na.rm = TRUE)})

  # Ratio and difference of all intervention means compared to reference.
  # These are finally the effect estimates.
  result_ratio <- int_result / ref_mean
  result_diff <- int_result - ref_mean

  percent_intervened_res <- get_percent_intervened(pools = pools)

Now we have a single effect estimate for each intervention compared to the references. Again, to estimate the uncertainty of this estimate, bootstrapping is used. I skip code related to this. Result packaging is the only thing left now.

# Bootstrapping skipped.

    # Confidence intervals are computed based bootstrapping results.
    if (ci_method == 'normal'){
      ci_lb_result <- t(int_result) - stats::qnorm(0.975)*se_result[,-c('t0')]
      ci_lb_MR <- t(int_result) - stats::qnorm(0.975)*se_MR[,-c('t0')]
      ci_lb_MD <- t(int_result) - stats::qnorm(0.975)*se_MD[,-c('t0')]
      ci_ub_result <- t(int_result) + stats::qnorm(0.975)*se_result[,-c('t0')]
      ci_ub_MR <- t(int_result) + stats::qnorm(0.975)*se_MR[,-c('t0')]
      ci_ub_MD <- t(int_result) + stats::qnorm(0.975)*se_MD[,-c('t0')]
    }
    if (ci_method == 'percentile') {
      ci_lb_result <- comb_result[, lapply(.SD, stats::quantile, probs = 0.025), by = t0]
      ci_lb_MR <- comb_MR[, lapply(.SD, stats::quantile, probs = 0.025), by = t0]
      ci_lb_MD <- comb_MD[, lapply(.SD, stats::quantile, probs = 0.025), by = t0]
      ci_ub_result <- comb_result[, lapply(.SD, stats::quantile, probs = 0.975), by = t0]
      ci_ub_MR <- comb_MR[, lapply(.SD, stats::quantile, probs = 0.975), by = t0]
      ci_ub_MD <- comb_MD[, lapply(.SD, stats::quantile, probs = 0.975), by = t0]
    }
  }

  # Tabulating and packaging results skipped here.

  # Finally, an object like this will be returned which contains all the results.
  res <- list(
    result = resultdf,
    coeffs = coeffs,
    stderrs = stderrs,
    vcovs = vcovs,
    rmses = rmses,
    fits = fits,
    sim_data = sim_data,
    IP_weights = obs_results$w,
    bootests = bootests,
    bootcoeffs = bootcoeffs,
    bootstderrs = bootstderrs,
    bootvcovs = bootvcovs,
    time_name = time_name,
    time_points = time_points,
    covnames = covnames,
    covtypes = covtypes,
    # Results include visualizations.
    dt_cov_plot = plot_info$dt_cov_plot,
    dt_out_plot = plot_info$dt_out_plot,
    header = header
  )
  class(res) <- c("gformula_binary_eof", "gformula")
  return (res)
}

CC BY-SA 4.0 Eero Teppo. Last modified: March 23, 2025. Website built with Franklin.jl and the Julia programming language.