Learning Probabilistic Programming in Stan From 10 Examples

Published: 2024-02-03

Many of the examples are from Stan's User's Guide (version 2.34).

Stan is a very mature probabilistic programming language. Learning a new language (especially domain-specific) can be difficult, time-consuming, and hence off-putting, but in the case of Stan, the curve is not too steep.

Installation

The easiest way to install Stan might be from the GitHub Releases page. Just select the correct pre-built tarball (tar.gz-file). Unpack it and run make build. If lucky, everything is ready to compile Stan to C++ and C++ to executables – and to run the executable.

Direct inferfaces to different programming languages exist but you can just use shell scripting to orchestrate the full workflow. Also simplified interfaces exist (like brms in R) but they hide pretty important aspects of the workflow.

I think it's best to just learn Stan and keep it simple on the command line – any scientist or analyst can do it if they just make the decision.

Bernoulli distribution

The command-line interface to Stan – CmdStan – includes the following example.

The data contains 10 observations of a single variable with two possible values 0 and 1. CmdStan accepts data in the JSON format.

{
    "N": 10,
    "y": [0,1,0,0,0,0,0,0,0,1]
}

The model represents our assumptions and knowledge about the data generating mechanism. Here we assume that the data are independent random samples from the same 0/1-valued distribution with some probability between 0 and 1 for getting the value 1. In advance, we assume that any probability of getting 1 is equally likely.

// file for example: /path/to/bernoulli.stan

// Stan code consists of specific curly-braced blocks: data, transformed data, 
// parameters, transformed parameters, model, generated quantities, functions...

// In data, we need to declare types of the data variables.
data {
  // N is an integer from 0 to infinity.
  int<lower=0> N;

  // y is an array of length N containing integers between 0 and 1.
  array[N] int<lower=0,upper=1> y;
}

// In parameters, we declare types of the parameter variables.
parameters {
  // 'prob' is a real number between 0 and 1.
  real<lower=0,upper=1> prob;
}

// In model, we describe how data and parameters are generated from distributions.
model {
  // I assume that you know something about distributions or just ignore them.
  // 'prob' comes from a Beta distribution with initial parameter values at 1.
  prob ~ b(1,1);

  // y comes from a Bernoulli distribution with chance of 'prob' to get 1.
  y ~ bernoulli(prob);
}

And using CmdStan, this Stan model in path/to/bernoulli.stan can be compiled and run with the following commands:

### Compile Stan file to an executable binary.
### Note that the 'make' command works only in the root of the stan installation (see makefile).
make /path/to/bernoulli
### Check all the subcommand and arguments that these executables accept.
/path/to/bernoulli help-all
### Get the posterior distribution.
/path/to/bernoulli sample \
                   data file=/path/to/bernoulli.data.json \
                   output file=/path/to/output.csv

First, look at the terminal output given by CmdStan.

method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.05 (Default)
      delta = 0.8 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
      save_metric = 0 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
    num_chains = 1 (Default)
id = 1 (Default)
data
  file = /home/e/stan-examples/bernoulli.data.json
init = 2 (Default)
random
  seed = 94865813 (Default)
output
  file = /home/e/stan-examples/bernoulli.output.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = profile.csv (Default)
  save_cmdstan_config = 0 (Default)
num_threads = 1 (Default)


Gradient evaluation took 5e-06 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  100 / 2000 [  5%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  300 / 2000 [ 15%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  500 / 2000 [ 25%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  700 / 2000 [ 35%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration:  900 / 2000 [ 45%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1100 / 2000 [ 55%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1300 / 2000 [ 65%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1500 / 2000 [ 75%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1700 / 2000 [ 85%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 1900 / 2000 [ 95%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.009 seconds (Warm-up)
               0.015 seconds (Sampling)
               0.024 seconds (Total)

The output file contains the posterior distribution of prob as 1000 samples (called draws) – coupled with some metrics that one can use to evaluate the sampling process. I'll show the top of the csv-data below.

lp__,accept_stat__,stepsize__,treedepth__,n_leapfrog__,divergent__,energy__,prob
### Adaptation terminated
### Step size = 1.05164
### Diagonal elements of inverse mass matrix:
### 0.523751
-6.84899,1,1.05164,1,1,0,7.0244,0.196819
-6.75482,0.993712,1.05164,1,3,0,6.85091,0.264766
-7.24197,0.828289,1.05164,2,3,0,7.61544,0.141269
-7.2419,1,1.05164,1,1,0,7.36519,0.141275
-6.75917,0.946023,1.05164,2,3,0,7.39212,0.268964
-7.45293,0.823243,1.05164,1,3,0,7.45876,0.413672
-6.86534,0.978925,1.05164,1,3,0,7.53973,0.192941
-6.86534,0.862008,1.05164,1,3,0,7.54765,0.192941
-6.84792,1,1.05164,2,3,0,6.8861,0.197085
-6.74805,0.983857,1.05164,2,3,0,6.93031,0.250899
-6.7486,0.999858,1.05164,1,3,0,6.74861,0.254279
-6.83931,0.928131,1.05164,2,3,0,7.04533,0.305763
...

CmdStan also comes with a method for summarizing the output. This is good for a quick look but normally you should read the results into some other data science tool and continue from there. Visualization is the key.

bin/stansummary /path/to/bernoulli.output.csv

Running that we get...

Inference for Stan model: bernoulli_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took 0.0090 seconds
Sampling took 0.015 seconds

                Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat

lp__            -7.3  3.3e-02     0.73   -8.6  -7.0  -6.7    471    31408   1.00
accept_stat__   0.90  6.4e-03  1.6e-01   0.54  0.96   1.0    609    40595   1.00
stepsize__       1.1      nan  8.9e-16    1.1   1.1   1.1    nan      nan    nan
treedepth__      1.4  1.7e-02  4.9e-01    1.0   1.0   2.0    853    56856   1.00
n_leapfrog__     2.3  3.6e-02  9.6e-01    1.0   3.0   3.0    711    47397    1.0
divergent__     0.00      nan  0.0e+00   0.00  0.00  0.00    nan      nan    nan
energy__         7.8  5.3e-02  1.1e+00    6.8   7.5   9.9    402    26793   1.00

prob            0.24  6.7e-03     0.12  0.084  0.22  0.46    298    19869   1.00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

I don't cover any evaluation methods (or criticism as it's sometimes called in Bayesian analysis) in this post so let's just assume that everything checks out. In general, you should know that Bayesian modelling is a very iterative process where you need to use different kind of evaluation methods all the time. The evaluation methods help you to make more and more accurate models. Infact the MCMC sampling itself is a good evaluation method – the sampling is often slow and weird when your model is not good at all.

The result? In short, given our assumptions and observations, the probability of 1 is about 24% (mean) somewhere between 8–46% (90% credible interval).

Logistic regression

I showed with model 1 how to represent data and compile and run Stan via the CmdStan interface. For the rest of the models, I'll focus on the model more.

Let's consider simple data from two variables like this:

{
  "N": 100,
  "x": [0.2,0.6,1,-1,0.2,0.6,...],
  "y": [1,1,0,0,1,0,0,0,1,0,1,...]
}

For the model, let's again assume that the y values are generated independently from the same Bernoulli distribution with some probability. However, this time the probability is somehow based on (or caused by) the value of x associated with the same observation. Say, we assume some usual linear transformation of x like a + b*x representing a linear relationship between y and x. But we also need to make sure that the Bernoulli probability is between 0 and 1. A common way is to think of a + b*x as a log-odds value and use its inverse function to turn it into a probability. With these transformations, the final relationship between x and probability of y is an S-curve – close to the linear pattern we originally wanted to assume.

Finally, we can turn these assumptions into a Stan model.

data {
  // N is an integer between 0 and infinity.
  int<lower=0> N;
  // x is a vector of length N.
  // Vector-type is needed for linear algebra operations. An array of reals wouldn't work.
  vector[N] x;
  // y is an array of integers between 0 and 1.
  array[N] int<lower=0, upper=1> y;
}

parameters {
  // a and b are real numbers.
  real a, b;
}

model {
  // a and b are sampled (~) from normal distributions with very wide priors.
  a ~ normal(0, 10);
  b ~ normal(0, 10);

  // Note that this is not a fully generative model.
  // We are not modelling how x is generated. It's just fixed input data.

  // Declare types of all variables used.
  vector[N] log_odds;
  vector[N] prob;

  // log_odds is defined as (=) the linear transformation.
  log_odds = a + b * x;
  // Taking the inverse gives a real number between 0 and 1.
  prob = inv_logit(log_odds);
  // inv_logit was vectorized in version 2.13. Documentation helps to avoid issues.

  // y is sampled from (~) independent Bernoulli distributions.
  y ~ bernoulli(prob);

  // Stan sometimes implements common models as a single function
  // to make them better than the multi-step user-defined operation
  // like ours above. In this case: bernoulli_logit_glm(x, a, b).
}

Exponential survival regression

In the next dataset we have followed 300 units over time and recorded the time when they experienced an event (100 did) or when the follow-up ended (200). All units were followed only up to a single maximum duration (for simplicity) – time point 100. In addition, five variables were measured about the units in order to study their relationship with time-to-event.

{
  "N": 100,
  "N_cens: 200,
  "K": 5,
  "t": [34.3,12.12,26.9,14.4,78.2,54.5,96.8,...],
  "t_cens": 99.0,
  "x": [[1.0,2.0,3.0,4.0,5.0],
        [6.0,2.0,3.0,8.0,5.0],
        [6.0,2.0,4.0,5.0,6.0], 
         ...],
  "x_cens": [[1.0,2.0,3.0,4.0,5.0],
             [6.0,2.0,3.0,8.0,5.0],
             [6.0,2.0,4.0,5.0,6.0], 
              ...],
}

Let's assume that the risk of an event at any time point is constant and the same for all units. In this case, time-to-event has a negative exponential distribution (in other words, it models time between events in a Poisson process). We also assume that the event-rate (or its inverse called scale) depends on the five given explanatory variables in a linear fashion, b * x, although since this has to be positive, we use the natural exponential function, exp(b * x).

Censoring means that we don't know the exact time-to-event value for the particular observation. But we do know that this value is larger than the censoring time. Hence, instead of using the "point" probability like for the observed event times, we can just use the "larger than" probability for the censored observations.

data {
  // K, N, N_cens are integers from 0 to infinity
  int<lower=0> K, N, N_cens;

  // t is a vector of length N.
  vector[N] t;
  // t_cens is a real from 0 to infinity.
  real<lower=0> t_cens;

  // x is a matrix with N rows and K columns.
  matrix[N, K] x;
  // x_cens is a matrix with N_cens rows and K columns.
  matrix[N_cens, K] x_cens;
}

parameters {
  // b is a vector of length K. 
  // One real number coefficient for each variable in x and x_cens.
  vector[K] b;
}

model {
  // Every real number in b has an independent Gaussian prior.
  b ~ normal(0, 2);

  // Observed times-to-event are samples from an Exponential distribution with a scale.
  // Note! This is equivalent to writing "target += exponential_lpdf(t | b);"
  // Probability density function (pdf) gives the density of "this exact value".
  t ~ exponential(exp(x * b));

  // Censored observations must contribute to the joint density too (otherwise biased).
  // Complementary cumulative density function (ccdf) gives the "higher than" density.
  target += exponential_lccdf(t_cens | exp(x_cens * b));

  // "target" is a special Stan keyword. It refers to the joint density of parameters
  // given the data (posterior). It's literally incremented (+=) during MCMC.
}

Gaussian process regression

The next data set will be just a bunch of observations of two continuous variables – with a pretty complex twist.

{
  "N": 100,
  "x": [0.1,-0.5,0.4,...],
  "y": [0.1,-0.5,0.4,...]
}

Let's assume that y depends on x in a smooth, flexible, local manner. Instead of selecting a specific functional form, we could think of a distribution over smooth functions – or specifically functions whose results on the input points have a multivariate Gaussian distribution. This is roughly what Gaussian Processes assume.

data {
  // N is an interger from 0 to infinity.
  int<lower=1> N;
  // x is an array of length N of real numbers.
  array[N] real x;
  // y is a vector of length N.
  vector[N] y;
}

// In transformed data blocks you define data that doesn't come as input.
transformed data {
  // delta is a very small real value to add to a zero that should be positive.
  real delta = 1e-9;
}

parameters {
  // This model doesn't have many easily interpretable parameters (like linear coefficients).
  // You need to compare some predictions of interest.
  real<lower=0> len;
  real<lower=0> disp;
  real<lower=0> noise;
  // eta is a vector of length N used for generating the random functions.
  vector[N] eta;
}

model {
  // K is a symmetric matrix depending on exponentiated squared 
  // distances between all pairs of values in x.
  // This gives the local shape of the functions. Close inputs, close outputs.
  matrix[N, N] K = gp_exp_quad_cov(x, disp, len);

  // Two parameters further control the shape of the functions.
  len ~ inv_gamma(5, 5); // Roughly the scale of distances between points.
  disp ~ std_normal(); // Roughly the scale of output values.

  // For technical reasons, the diagonal elements (self-distance) cannot be 0.
  for (n in 1:N) {
    K[n, n] = K[n, n] + delta;
  }

  // Also for technical reasons, it is better to use a different 
  // representation of K called its Cholesky decomposition labelled L.
  matrix[N, N] L_K;
  L_K = cholesky_decompose(K);

  // With L representing K, we can generate a function as points.
  // So f is a vector of length N (the random function).
  vector[N] f;
  // eta is a vector of length N from Normal(0,1).
  eta ~ std_normal();
  // Equivalent to "f ~ multi_normal(0, K)".
  f = L_K * eta;

  // And finally y is a random sample from a Gaussian around the generated function.
  noise ~ std_normal();
  y ~ normal(f, noise);
}

Decision analysis

Probabilistic models are great for describing systems with uncertainty. But sometimes you actually need to make a decision based on the probabilities. In this case you can use a utility function.

Data consists of observations of the cost and duration of trips made by different modes of transportation.

{
  "N": 100,
  "mode": [1,2,4,3,2,1,3,2,...],
  "cost": [2,14,55,2,12,34,5,...],
  "time": [12,45,23,56,12,...]
}

We assume that different modes of transport have independent cost and time distributions. Both cost and time must be larger than 0 and tend to be around an average but with some rare high values.

// In the functions block, you define any custom function used elsewhere.
functions {
  // Type of the output value first. And don't forget to type the arguments.
  real utility(real cost, real time) {
    // Time is "priced" at 25 units per hour.
    return -(cost + 25 * time);
  }
}

data {
  int<lower=0> N;
  // Mode of transport is categorical; 1-4 are just labels and used as indeces.
  array[N] int<lower=1, upper=4> mode;
  array[N] real cost;
  array[N] real<lower=0> time;
}

parameters {
  // Time.
  vector[4] mean_time;
  vector<lower=0>[4] var_time;
  // Cost.
  array[4] real mean_cost;
  array[4] real<lower=0> var_cost;
}

model {
  // Time is generated from a transport-specific Log Gaussian distribution.
  mean_time ~ normal(0, 1);
  // Remember that mean_time is vector[4] so 4 values are sampled.
  var_time ~ lognormal(0, 0.25);
  // [i] accesses the i-th element in the array.
  time ~ lognormal(mean_time[mode], var_time[mode]);

  // Cost is generated from a mode-specific Log Gaussian distribution.
  mean_cost ~ normal(0, 20);
  var_cost ~ lognormal(0, 0.25);
  cost ~ lognormal(mean_cost[mode], var_cost[mode]); 
}

// In a generated quantities block, you produce outputs that are not parameters
// and not part of the model – using the parameter posteriors.
generated quantities {
  // Calculate utility of each mode of transport.
  array[4] real utilities;

  for (k in 1:4) {

    utilities[k] = utility(
      // Generate costs and times from the posterior
      // to get the posteriors of utilities.
      lognormal_rng(mean_time[k], var_time[k]), 
      lognormal_rng(mean_cost[k], var_cost[k])
    );

    // To make a decision, you compare the utility distributions in some way.
  }
}

Multilevel regression and poststratification

An efficient way to collect data is to take random samples within subgroups that have less variation than the whole population of interest. The problem is that this kind of data is not representative of the population. And even if one doesn't collect data like this, the data might be already unrepresentative of some target population of interest. So how to "transport" such results back to the population level?

The next dataset has 0/1-values within subgroups by three variables – area, age, and income – and is not representative. But the data also includes the number of people in each of those subgroups in the target population.

{
  "N": 10000,
  "y": [1,1,1,0,0,1,0,1,...],
  "area": [1,1,1,1,1,...2,2,2,2,...],
  "age": [1,2,3,1,4,2,3,1,...],
  "income": [5,3,2,4,1,5,2,3,5,1,...],
  "P": [[[2,3,4,2,6,4,6,4,6,2,4,6,...],
         [2,3,4,...],
         [2,4,3,...],
         [5,4,3,...],
         [3,4,5,...]], 
        [...], 
        [...], 
        [...]]
}

data {
  int<lower=0> N;
  array[N] int<lower=1, upper=50> area;
  array[N] int<lower=1, upper=4> age;
  array[N] int<lower=1, upper=5> income;
  array[N] int<lower=0> y;
  // P is a 3-dimensional array of positive integers.
  array[4, 5, 50] int<lower=0> P;
}

parameters {
  real alpha;

  real<lower=0> sigma_b;
  // "multiplier" and "offset" can be used to declare 
  // a non-centered parametrization for the distribution.
  // This is good in hierarchical models for technical computational reasons.
  vector<multiplier=sigma_b>[4] b;

  real<lower=0> sigma_gamma;
  vector<multiplier=sigma_gamma>[5] gamma;

  real<lower=0> sigma_delta;
  vector<multiplier=sigma_delta>[50] delta;
}

// Note that the order of sampling statements doesn't matter.
model {
  // Each level of each variable adds its own weight, adjusting a global log-odds (alpha).
  y ~ bernoulli_logit(alpha + b[age] + gamma[income] + delta[area]);
  // Note that "bernoulli(inv_logit(...))" is equivalent but less performant.

  alpha ~ normal(0, 2);
  // But the weights of each variable are similar, coming from the same Gaussian.
  b ~ normal(0, sigma_b);
  gamma ~ normal(0, sigma_gamma);
  delta ~ normal(0, sigma_delta);

  // { ... } allows compact assignment and sampling statements
  // when the right-hand side is the same for all variables.
  { sigma_b, sigma_gamma, sigma_delta } ~ normal(0, 1);

  // It's possible to add a sum-to-zero constraint to the weights like
  // "sum(b) ~ normal(0,0.001)". This can further ease non-identifiability
  // in which wildly different parameters lead to an exactly same likelihood result.
}

// The model above can predict the probability of y=1 for all subgroups.
// To get the probability for the whole population, you need to 
// count the expected number of y=1 in every target population subgroup and divide this
// with the total target population count.
generated quantities {
  real expected_y = 0;
  int total = 0;
  int<lower=0> N_P;

  // Loop over all combinations of values of area, income, and age.
  for (b in 1:4) {
    for (c in 1:5) {
      for (d in 1:50) {

        // Access size of the subgroup in the targer population from the data.
        N_P = P[b, c, d];
        // Logistic function returns the probability like in the model.
        expected_y += N_P * inv_logit(alpha + b[b] + gamma[c] + delta[d]);
        // Note that this doesn't include the bernoulli_rng like in posterior
        // predictive distributions. We just care about the mean/expectation.
        total += N_P;

      }
    }
  }

  // Finally the expected population probability. 
  real<lower=0, upper=1> pop_prob = expect_pos / total;
}

G-formula

G-formula is a valid method for causal inference from time-varying data. Many causal methods are taught and developed using frequentist models, and sometimes making them fully probabilistic is a bit questionable (like inverse probability weighting). But g-formula is straightforward.

Everything here is based on a paper by Oganisian and Roy and their code in GitHub.

The data consists of a cause A over time (five time-points), a confounder L over time, and an outcome Y at the end. (We'll ignore the confounding stuff for now; just know that this method and example requires specific kind of causal assumptions.)

{
  "N": 1234,
  "Nt": 5,
  "A": [[1,2,3,4,5], ...],
  "L": [[1,2,3,4,5], ...],
  "Y" [1,2,3,4,5,6,6,7,8,...],
  "n_coeffs": [],
  // Number of coefficients needed for all confounder models.
  // All of them will be stored in the same vector.
  "n_coeffs": ...,
  // Start and end indeces indicating which coefficient belongs
  // to which model and time-point. We'll ignore the specifics of this.
  "st": ...,
  "ed": ...
}

I'll describe the model in the comments now.

data {
  int N;
  int Nt;
  
  matrix[N, Nt] A;
  matrix[N, Nt] L;
  vector[N] Y;

  int n_coeffs;
  // Coefficients for many models will be stored in a single vector.
  // To set and get the right ones, you need the right indeces.
  // In this code these indeces have been calculated in advance.
  // (This is definitely a slight con for Stan.)
  int st[Nt-1];
  int ed[Nt-1];
}

parameters {
  // Confounder is a linear function of treatment-confounder history.
  // We want the parameters to predict counterfactuals. 
  vector[Nt] int_l; // Intercepts.
  vector[n_coeffs] b_Ll; // Confounder history coefficients.
  vector[n_coeffs] b_Al; // Treatment history coefficients.
  
  // Outcome is a linear function of treatment-confounder history.
  // We want the parameters to predict counterfactuals.
  real int_y;
  vector[Nt] b_Ly;
  vector[Nt] b_Ay;
  
  // Note that standard deviation are declared positive here.
  // In the model, they given priors which can take negative values.
  real<lower=0> var_L1; 
  real<lower=0> var_Lt; 
  real<lower=0> var_y; 
}

model {
  
  // Outcome Y is generated by a simple linear model where
  // confounder and treatment value at every time-point
  // are included as independent variables.
  Y ~ normal(int_y + L*b_Ly + A*b_Ay, var_y);
 
  int_y ~ normal(0, 0.5);
  // Note that all time points get the same prior.
  b_Ly ~ normal(0, 0.5);
  b_Ay ~ normal(0, 0.5);
  var_y  ~ cauchy(0, 2);

  // Exchangeability for the treatment needs to hold at every time-point.
  // So we need to model the confounders at every time point.

  // The first confounder value in column 1 is also modelled (simply as Gaussian).
  L[,1] ~ normal(int_l[1], var_L1);
  int_l ~ normal(0, 1);
  // Cauchy has a taller peak and fatter tails than the Gaussian.
  // Fun fact: Cauchy doesn't obey the law of large numbers.
  var_L1 ~ cauchy(0, 2);

  // The future confounder values are modelled similarly but adjusting for histories.
  real mean_Lt;
  var_Lt ~ cauchy(0, 2);

  for (t in 2:Nt) {

    mean_Lt = int_l[t] + 
              // Indeces are used to access the correct data and parameters.
              L[, 1:(t-1)] * b_Ll[st[t-1]:ed[t-1]] + 
              A[,1:(t-1)] * b_Al[st[t-1]:ed[t-1]];
    L[ ,t] ~ normal(mean_Lt, var_Lt); 

  }


  // But now the weights are expected to be smaller further in the past.
  // This reasonable assumption automatically reduces computational issues.
  // This is common in Bayesian modelling. Bad assumptions lead to computational problems.
  for (t in 2:Nt) {

    // b loops from start-index to end-index of the parameter vector
    // at each time point, setting the prior to the right coefficient.
    for (b in st[t-1]:ed[t-1] ) {

     b_Ll[b] ~ normal(0, 1.5^(b - st[t-1]));
     b_Al[b] ~  normal(0, 1.5^(b - st[t-1]));

    }
  }
  
}

generated quantities {
  // Linear algebra has the notion of column and row vectors.
  // "vector" in Stan is a column vector.
  row_vector[Nt] L_pred1;
  row_vector[Nt] L_pred0;
  
  // Our causal effect estimate of interest is the absolute difference of means.
  real mean1;
  real mean0;

  // We need to simulate the whole model from time point 1 under alternative treatments.

  // So get the first confounder values from the Gaussian.
  L_pred1[1] = normal_rng(int_l[1], var_L1);

  // And the rest of the confounder values under treatment 1.
  for(t in 2:Nt){
    L_pred1[t] = normal_rng(
      // Again accessing correct coefficients from the vectors.
      int_l[t] + L_pred1[1:(t-1)] * b_Ll[st[t-1]:ed[t-1]] + sum(g[st[t-1]:ed[t-1]]),  
      var_Lt 
    ); 
  }
  // And finally the outcome model.
  // A is just all 1 here so the same as summing the coefficients.
  mean1 =  L_pred1 * b_Ly + sum(b_Ay);

  // The same for treatment set to 0.
  L_pred0[1] = normal_rng(int_l[1], var_L1);
  for(t in 2:Nt){
    L_pred0[t] = normal_rng(
      int_l[t] + L_pred0[1:(t-1)] * b_Ll[st[t-1]:ed[t-1]], 
      var_Lt
    ); 
  }
  // Treatment is 0 so multiplying with treatment coefficients just goes to 0.
  mean0 = L_pred0 * b_Ly;

}

State-based differential epidemics

The next dataset contains the number of officially reported cases over time (acknowledging underreporting) in an epidemic and the date at which control measures were put in place, as well as the number of former infections in a random sample of the population (based on antibodies in asymptomatic people).

Everything is based on code (and paper) by Grinsztajn et. al. (Really nice paper about iterative model development in Stan!)

{
  "N": 10000000,
  "n_days": 200,
  "cases": [12,45,135,345,789,1000,...],
  "t0": 1,
  "tswitch": 67,
  "ts": [1,2,3,4,5,...],
  "t_survey_start": 23,
  "t_survey_end": 34,
  "n_infected_survey": 126,
  "n_tested_survey": 500
}

The modelling assumptions are described within the program.

data {
  int N;
  int<lower=1> n_days;
  int cases[n_days];

  real t0;
  real tswitch;
  real ts[n_days];

  int t_survey_start;
  int t_survey_end;
  int n_infected_survey;
  int n_tested_survey;
}

// In transformed data block you define input values that don't exist in input data.
transformed data {
  // { thing } is an array of length 1.
  int x_i[1] = { N };
  real x_r[1] = { tswitch };
}

parameters {
  // Incubation, transmission, recovery rates.
  real<lower=0> a;
  real<lower=0> beta;
  real<lower=0> gamma;

  // Control measures.
  real<lower=0,upper=1> eta;
  real<lower=0> nu;
  real<lower=0,upper=1> xi_raw;

  // Initial state.
  real<lower=0> i0;
  real<lower=0> e0;

  // Underreporting.
  real<lower=0, upper=1> p_reported;

  // Overdispersion around predicted case count.
  real<lower=0> phi_inv; 
}

model {
  // Transmission, incubation, recovery.
  beta ~ normal(2, 1);
  a ~ normal(0.4, 0.5);
  gamma ~ normal(0.4, 0.5);

  // Overdispersion.
  phi_inv ~ exponential(5);

  // Initial state (exposed, infected).
  e0 ~ normal(0, 10);
  i0 ~ normal(0, 10);

  // Control measures parameters.
  eta ~ beta(2.5, 4);
  nu ~ exponential(1./5);
  xi_raw ~ beta(1, 1);

  // Underreporting of infected.
  p_reported ~ beta(1, 2);
  
  // Randomness around an expected new case count is modelled with 
  // a negative binomial distribution (with mean and overdispersion). 
  // Note this is a weird parametrization – more like a flexible Poisson thing.
  cases[1:(n_days-1)] ~ neg_binomial_2(incidence, phi);

  // The substudy case count is basic binomial from a smaller number of tests.
  n_infected_survey ~ binomial(n_tested_survey, p_infected_survey);
}

// A good way to organize code in Stan is to do sampling (~) in the model block
// and the other model stuff (=) in the transformed parameters block.
transformed parameters{
  real xi = xi_raw + 0.5;

  // SEIR differential equation parameters.
  real theta[8];
  theta = {beta, gamma, a, eta, nu, xi, i0, e0};

  // y holds the expected SEIR-state counts of the population.
  // How many people 1. Susceptible, 2. Exposed, 3. Infected, 4. Recovered.
  real y[n_days, 4];
  // "sir" is the differential equation used to model the counts in SEIR-states.
  // Stan comes with programs that solve these equations – i.e. inputs in, outputs out.
  y = integrate_ode_rk45(sir, rep_array(0.0, 4), t0, ts, theta, x_r, x_i);

  // Number of units entering "infected" at each time point.
  real incidence[n_days - 1];
  for (i in 1:n_days-1){
    incidence[i] = -(y[i+1, 2] - y[i, 2] + y[i+1, 1] - y[i, 1]) * p_reported; 
  }

  real<lower=0, upper=1> p_infected_survey;
  // Mean number of recovered during the survey should be the
  // probability of having antibodies in the antibody survey study.
  p_infected_survey = mean(to_vector(y[t_survey_start:t_survey_end, 4])) / N;
  
  // Negative binomial dispersion parameter.
  real phi = 1.0 / phi_inv;
}

// The "sir" differential equation is a custom function, defined in the functions block.
functions {

  real[] sir(real t, real[] y, real[] theta, real[] x_r, int[] x_i) {

      // Transmission rate.
      real beta = theta[1];

      // Assume transmission rate depends on the control measures following a
      // logistic form over time with three parameters representing...
      // the effect when measures are 100% effective,
      real eta = theta[4];
      // the slope of the decrease,
      real nu = theta[5];
      // and the time delay until they are 50% effective.
      real xi = theta[6];
      // Also need the time point when the measures start.
      real tswitch = x_r[1];

      // And so the control measures change the transmission rate.
      real forcing_function = switch_eta(t, tswitch, eta, nu, xi);
      real beta_eff = beta * forcing_function;

      // For technical reasons, this differential function is set up 
      // for deviations from the baseline.
      int N = x_i[1];
      real i0 = theta[7];
      real e0 = theta[8];
      real init[4] = {N - i0 - e0, e0, i0, 0};
      // The predicted deviations are added to the fixed baseline. 
      real S = y[1] + init[1];
      real E = y[2] + init[2];
      real I = y[3] + init[3];
      real R = y[4] + init[4];

      // Incubation rate.
      real a = theta[3];
      // Recovery rate.
      real gamma = theta[2];
      
      // The SEIR model is very simple.
      // Change in...
      // susceptibles depends on population and transmission from infectious.
      real dS_dt = -beta_eff * I * S / N;
      // exposed depends on transmission from infectious and incubation to infectious.
      real dE_dt =  beta_eff * I * S / N - a * E;
      // infectious depends on incubation from exposed and recovery to recovered.
      real dI_dt = a * E - gamma * I;
      // recovered depends on recovery from infectious.
      real dR_dt = gamma * I;
      
      // { a, b } is an array expression which returns an array[4] value.
      return {dS_dt, dE_dt, dI_dt, dR_dt};
  }

  real switch_eta(real t, real t1, real eta, real nu, real xi) {
    return (eta + (1 - eta) / (1 + exp(xi * (t - t1 - nu))));
  }

}

// Many other interesting quantities can be estimated based on the model.
generated quantities {
  // Basic reproduction number is transmitting relative to recovering.
  real R0 = beta / gamma;

  // Rates (1/time) as time.
  real recovery_time = 1 / gamma;
  real incubation_time = 1 / a;

  // Model-predicted number of cases over time.
  real pred_cases[n_days-1];
  pred_cases = neg_binomial_2_rng(incidence, phi);

  // Reproduction number over time under the control measures.
  real Reff[n_days]; 
  for (i in 1:n_days) {
    Reff[i] = switch_eta(i, tswitch, eta, nu, xi) * beta / gamma;
  }
}

Missing data

Bayesian modelling makes all kinds of tricky problems in data analysis easy. One of them is missing data. Because in fully probabilistic models, missing data is just the same thing as parameters.

Say we have data for a time-series where some of the time-points have missing data. For Stan, we need to input the observed values with their indeces (time-points) – and to make things easier, we input also the other indeces that have missing data.

{
  "N_obs": 100,
  "N_mis": 50,
  "y_obs": [10,5,13,10,11,15,8,5,6,9,7,...],
  "i_obs": [1,5,8,12,13,14,15,23,24,25,...],
  "i_mis": [2,3,4,6,7,9,...]
}

We simply assume that the time-series is Gaussian from the previous value as mean.

data {
  // Number of observed and missing values.
  int<lower=0> N_obs;
  int<lower=0> N_mis;
  // Indeces of observed and missing values.
  int<lower=1, upper=N_obs + N_mis> ii_obs[N_obs];
  int<lower=1, upper=N_obs + N_mis> ii_mis[N_mis];
  // Observed values.
  array[N_obs] real y_obs;
}

transformed data {
  // Number of time-points or potential values.
  int<lower=0> N = N_obs + N_mis;
}

parameters {
  // Missing data is just a parameter in Bayesian models.
  array[N_mis] real y_mis;

  real<lower=0> sigma;
}

transformed parameters {
  // The full y is just a mix of observed values and parameters.
  array[N] real y;
  y[ii_obs] = y_obs;
  y[ii_mis] = y_mis;
}

model {
  // The simple model is unaffected by missing values.
  // So missingness is assumed completely random.
  sigma ~ gamma(1, 1);
  y[1] ~ normal(0, 100);
  y[2:N] ~ normal(y[1:(N - 1)], sigma);
}

Just to make this clearer, let's take another model written by Vincent Arel-Bundock based on the fantastic book Statistical Rethinking (2nd edition) by Richarh McElreath.

data {
    int n;
    vector[n] K;
    vector[n] M;
    // In this case the missing values come in with the observed ones
    // encoded as positive infinities. Stan accepts this too. 
    vector[n] B;
}

parameters {
    // Model parameters.
    real mean_B;
    real a;
    real bB;
    real bM;
    real<lower=0> sd_B;
    real<lower=0> sd_K;

    // But again we handle the missing values as parameters.
    vector[n] B_miss;
}

transformed parameters {
    // Again combine observed values and parameters back into a single vector.
    vector[n] B_merge;
    for (i in 1:n) {
        // If encoded missing, take parameter (placeholder) from B_miss[n].
        if (B[i] == positive_infinity()) {
            B_merge[i] = B_miss[i];
        // If not, take observed value from B[n].
        } else {
            B_merge[i] = B[i];
        }
    }
}

model {
    // Assignment like this is vectorized too. DRY.
    [sd_K, sd_B] ~ exponential(1);
    [a, mean_B, bB, bM] ~ normal(0, 0.5);

    // B contains parameters and observed values so it's modelled.
    B_merge ~ normal(mean_B, sd_B);

    vector[n] mean_K;
    // But the M variable is not modelled which is common in regression.
    mean_K = a + bB * B_merge + bM * M;

    K ~ normal(mean_K, sd_K);
}

Compositional outcome

Compositional variables are quite special. You need to take into account that they completely depend on another: if one proportion increases, the others needs to decrease. They can also have an unknown scale, meaning that their absolute values sum to some unknown number (for a principled take on the scale issue, see Scale Reliant Inference).

It seems that a fairly valid path is to model the compositions as a multivariate distribution with some covariation, and even more importantly, transform the raw proportions into log-odds ratios against a reference.

The next imaginary dataset consists of proportions of votes that different parties got in a bunch of subregions of a country, and some variables that may or may not predict the vote composition of that region.

The code is based on Simon Jackman's Stan translation of earlier earlier BUGS example.

data {
  // Number of regions.
  int N;
  // Number of parties minus one (after the log-odds ratio transformation).
  int K;
  // Multivariate outcome variable (log-odds ratios).
  // y is an array (length N) of vectors (length K).
  vector[K] y[N];

  // Number of predictors.
  int P;
  // X is an array (length N) of vectors (length P).
  vector[P] X[N];
}

// You can also arrange the code top-down. Here we go from outcome model to details.
model {
  // Vote composition comes from multivariate Student's T distribution with covariance.
  // Log-odds ratios can have weird fat-tailed distribution. This is a flexible assumption.
  y ~ multi_student_t(nu, mu, Sigma);

  // Degrees of freedom controls tail-fatness.
  nu ~ gamma(2, 0.1);

  // mu, the expected value of each log-odds ratio, depends linearly on predictors.
  vector[K] mu[N];
  for (i in 1:N) {
    for (k in 1:K) {

      mu[i, k] = alpha[k] + dot_product(X[i], beta[k]);

    }
  }

  // And the linear model parameters get some priors (vectorized).
  alpha ~ normal(0, 10);
  beta ~ normal(0, 2.5);

  // Finally, the scale of the multivariate t-distribution is a covariance matrix.
  matrix[K, K] Sigma;
  // Built up from a Cholesky factor (S = LL') – better for computational performance.
  Sigma = crossprod(diag_pre_multiply(Sigma_scale, Sigma_corr_L));
  // Prior on the diagonals (variance) and Cholesky factor of Sigma.
  Sigma_corr_L ~ lkj_corr_cholesky(5);
  Sigma_scale ~ cauchy(0.0, 2);
}

// We can also output the correlation matrix by reconstruction.
generated quantities {
   corr_matrix[K] S;
   // Correlation matrix S is LL' where L is its Cholesky factor.
   S = multiply_lower_tri_self_transpose(Sigma_corr_L);
}

// And finally, all the model parameters whose posteriors we want to estimate.
parameters {
  vector[K] alpha;
  vector[P] beta[K];
  cholesky_factor_corr[K] Sigma_corr_L;
  vector[K] Sigma_scale;
  real nu;
  vector[K] alpha_loc;
  vector[K] alpha_scale;
  vector[P] beta_loc[K];
  vector[P] beta_scale[K];
  real Sigma_corr_shape;
  real Sigma_scale_scale;
}

CC BY-SA 4.0 Eero Teppo. Last modified: March 23, 2025. Website built with Franklin.jl and the Julia programming language.