Statistical Notation in English in 23 Examples

Published: 2024-07-18

Mathematical notation can vary across fields, authors, and papers. Here I'm using various sources on causal inference and Bayesian methods as well as basic Wikipedia pages.

Understanding statistics papers undoubtably requires years of mathematical training. I have effectively zero but I'd still want to be able to read some of the mathematical language used to describe statistical (probabilistic) methods for data analysis. So here's a list of examples that can teach you the basics.

1

\[ Y, A, L \]

Unknown values that are generated in an unpredictable way (random variables are uppercase – yes the actual mathematical definition is not like this but this should be good enough). Random variables are sampled to get new values.

2

\[ Y^{a=1} = 1 \]

Y would generate 1, if the value of A was set to be 1. Lowercase letters refer to known, already generated, realized, observed, fixed values. Roman letters are usually for data and Greek for parameters in models. The superscript notation here is counterfactual and not standard statistics. But counterfactual language is so important in a lot of science so I include it in this post.

3

\[ A_{i} = a, Y^{a}_{i} = Y^{A}_{i} = Y_{i} \]

For the observational unit identified with index i, A generates some fixed value a (like 1 or 0). In this case, Y generates the same value for the unit i in three different conditions:

when A is counterfactually set to be the generated value a
when A is counterfactually set to be the observed value from A (the same)
when Y is merely observed without doing anything to A

This is a definition of consistency in causal inference. In short, the factual observations reveal the outcome in one hypothetical (counterfactual) situation. But once the one is revealed, the other ones are forever missing – you can't treat the same person with all alternative treatments at the same time.

4

\[ E[Y^{a=1}] \ne E[Y^{a=0}] \]

The expected value of the values generated by Y when the values of A are set to 1, is not equal to the expected value when it's set to 0. The expected value is the mean, or center of mass, of the probability distribution of Y. Probability distributions tell the probability of all possible values (mass distribution) or intervals of values (density distribution) of the variable. Expected value does not mean the most probable value (the mode) nor the middle value (median).

5

\[ \frac{P(Y^{a=1}=1)}{P(Y^{a=0}=1)} \]

The ratio of 1) the probability that Y generates 1 given A is set to 1 and 2) the probability that Y generates 1 given A is set to 0. This is the causal risk ratio.

6

\[ \sum_{x}{p_{X}(x)} = 1 \]

The sum of the probabilities of all possible values of x is 1. The probability mass function (pmf) is sometimes shortened as just p(x), which gives the probability that X generates any specific value x, that is, P(X = x).

7

\[ \Pr [a \le X \le b] = \int_a^b f_X(x) \, dx \]

The probability that X generates a value between a and b (note the notation Pr is the same as P) is the sum (integral) of probabilities that x falls in any of the infinitely small intervals between a and b. The probability density function (pdf) is sometimes shortened as just f(x).

8

\[ F_X(x) = \int_{-\infty}^x f_X(u) \, du = \operatorname{P}(X\leq x) \]

The cumulative distribution function (cdf) of values generated by X is the cumulative sum (integral) of probability density from minus infinity to any x, or in other words a sum of probabilities that values fall on any of the infinitely small intervals between minus infinity and x. This is in other words the probability that X generates values that are equal or smaller than any given x of interest.

9

\[ f(x;\mu,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2 } \]

This is the probability density function (pdf) of a Gaussian distribution. Semicolon means that mu and sigma are parameters of this function while x is the input. What is the difference? Not much. To get a pdf for x you first need to plug in some values to the parameters. Mu is the expected value or mean of the distribution and sigma is the standard deviation of the distribution.

Again, Roman letters are usually used for observed quantities (in the dataset) and Greek letters are used for parameters (in the model). In Bayesian statistics, the difference is not big since all of them are random variables and "data variables" can be even completely unobserved like parameters.

10

\[ \sum_{l} P(Y=1|L=l,A=1) \times P(L=l) \]

You have three variables Y, A, and L. For each possible value of L, get the probability that Y generated 1 given that (or conditional on that) L generated l and A generated 1, and multiple this probability with the overall probability that L generated l. Then sum these products together. This is a weighted average of conditional probabilities: the conditional probabilities with unlikely values of L get small weights and vice versa.

This is also called standardization of L (the observed distribution of L being the standard here), or marginalization over L (think of the margins of table containing sums). The resulting average is the conditional probability that would be observed if the distribution of L in the subgroup A=1 was the same as the distribution of L in the whole population (with any value of A).

Conditional probabilities that have been standardized to the same distribution are hence comparable with respect to that variable. Note that this is not counterfactual in any way: the vertical line | is used for factual conditions referring to the subset of observations where the condition is true, while the superscript notation is used for counterfactual conditions (interventions) referring to hypothetical observations where the condition is true for everyone (no subsets).

11

\[ \frac{P(Y=1) - P(Y^{a=0}=1)}{P(Y=1)} \]

Take the difference of 1) the probability that Y generated 1 in the dataset and 2) the probability that Y would have generated 1 if A was set to 0 for everyone counterfactually. This leaves you the probability that Y generated 1 because A was 1. Then take the ratio with the original, overall probability to see the proportion of probability of Y caused by A in this dataset. This is called the attributable fraction.

12

\[ E[Y^{a} - Y^{a=0}|A=a, L] = \beta_1 a + \beta_2 a L \]

The expected value of the differences of values from 1) Y if A was set to a value of interest a and 2) Y when A is set to the value 0 – given that A is observed a and L is its observed value (in a subset) – is equal to this particular linear combination of the a of interest and the observed value from L. This is called a structural nested mean model and it can be estimated (fitted, learned) using a procedure called G-estimation.

13

\[ P(Y_k = 0) = \prod_{m = 1}^{k} P(Y_m = 0 | Y_{m-1} = 0) \]

The probability that Y generates 0 up to the time point k is equal to the product of probabilities (from 1 up to k) that it's 0 at each time point m – given that it was also 0 in the previous time point m - 1. This happens when Y can be 1 only once over time. Only the subset of units that didn't have the event before can be used to calculate the probability at any particular time point. The risk at a specific time point is called the hazard. This expression for all time points k is called the survival function (since the event is often death in health science).

14

\[ logit Pr[Y_{k+1} = 1|Y_{k}=0, A] = \beta_0 + \beta_1 k + \beta_3 A + \beta_4 A k \]

The logit of the probability that Y generated 1 at the next time point k + 1 – given that Y was 0 at the this time point k and A had its observed value – is equal to the given linear combination of the previous time point k and the observed value of A. This is a logistic regression model of the hazard.

It can be fit to tidy data with one row for each unit at each time point (so called person-time format or long format). The condition Yk=0 is fulfilled by the fact that the data contains observations only for the surviving units at time point k. Finally, the survival function could be predicted using the fitted model for each value of A (0 and 1) and compared in any way of interest.

15

\[ Pr[Y^a_{k+1} = 0] = \sum_l \prod_{m=0}^k Pr[Y_{m+1}|Y_m = 0, L = l, A = a] \times Pr[L = l] \]

The probability of survival without an event up to the time point k + 1, if A was counterfactually set to some value a for everyone, can be estimated by standardizing a survival curve (given some fixed a) to the distribution of L (as a weighted average) and the survival curve can be estimated by multiplying the hazards up to that time point. Of course, L needs to be a valid confounder. The logistic model above can be used to get the hazard estimates when L is added as a predictor.

16

\[ E[Y|A] = \theta_0 + \theta_1 A \] \[ E[Y^a] = \alpha_0 + \alpha_1 a \] \[ Y^a \perp\!\!\!\perp A \] \[ \hat{\theta_1} = \hat{\alpha_1} \]

Let's say that the expected value of values from Y given the observed values from A (in subsets) are correctly modelled with the given linear function of A. Similarly, the expected value of Y if A was set to a for everyone counterfactually is modelled by a similar function (so-called marginal structural model). In this case, if we assume that the counterfactual outcome Y are independent of the factual/observed A (exchangeability), an estimate of the parameter theta also estimates the parameter alpha.

In addition to exchangeability, one also needs to assume consistency, positivity, and no other sources of biases.

16

\[ E[Y^{n=0, o=1}] = \sum_m E[Y|A=1, M=m] \times Pr[M=m|A=0] \]

For this expression you need to know that N and O are separable components of a treatment A so that M mediates the effect of N and O mediates a direct effect on Y. (In addition, assume positivity, consistency, exchangeability, and so on.)

In such a case, read: the expected value of Y – if the component N was set to 0 for all and the component O was set to 1 for all – can be identified by standardizing the expected value of Y in the observed subset A=1 to the observed distribution of M in the observed subset A=0 (as a weighted average).

17

\[ P(A|B) \propto P(B|A) \times P(A) = P(A,B) \] \[ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} \] \[ P(B) = P(B|A) P(A) + P(B|\not A) P(\not A) = \sum_j P(B|A_j) P(A_j) = E[P(B|A)] \]

Here A and B are events (like binary variables so that A means A=1). The probability of A conditional on B is proportional to the probability of B conditional on A multiplied by the probability of A which is just the (joint) probability both of them. If you count all the ways that A can happen, and then for all of those one-by-one how B can happen after, that's multiplication, and covers the same events as A and B both happening.

This is understandable since we're interested in counting all the ways that A can happen in the subset (conditional on) where B happens: first you could count all the ways that B can happen, and then see what's the proportion of ways where both B and A happened. The second expression is the Bayes' theorem.

You don't actually need to count B alone if you know the joint probability expression above the line. You can count all the ways that B can happen by adding ways that B can happen with A and without A – this covers all the cases so A can be dropped out. In other words, you can take a weighted average of the probability of B, or in general the expectation of the conditional probability of B. This is also called marginalizing over A.

18

\[ p(\theta|y) \propto p(y|\theta) p(\theta) = p(\theta, y) \] \[ p(\theta|y) = \frac{p(y|\theta) p(\theta)}{p(y)} \] \[ p(y) = \int p(y|\theta) p(\theta) \, d\theta = E[p(y|\theta] \]

These are the same expressions as above but in different notation. Here y refers to data and theta refers to parameters. Again, integrals are like sums of infinitely small intervals of values rather than the values itself. Bayes' theorem can be used to learn the probability of parameters conditional on data (posterior). The only ingredients needed are the probability of parameters (priors) and the probability of data given the parameters (likelihood).

19

\[ Y \sim Binomial(N,p) \] \[ p \sim Uniform(0,1) \] \[ N = 100 \] \[ Binomial(k,n,p) = \Pr(X = k) = \binom{n}{k}p^k(1-p)^{n-k} \] \[ Uniform(\alpha,\beta) = \frac{1}{\beta-\alpha} \] \[ Pr(p|Y,N) = \frac{Binomial(Y|N,p) \times Uniform(p|0,1)}{\int Binomial(Y|N,p) \times Uniform(p|0,1) \, dp} \]

Y is a count (of successes) generated randomly from a binomial distribution with N total attempts and p probability of adding one to the count (likelihood). p is generated randomly from a uniform distribution between 0 and 1 (prior). N is a fixed count. Binomial is a probability distribution for a process where the probability of success is p and you try N times and count the number of successes k – it can tell the probability of getting exactly k successes and so it can tell the probability of getting any value of k given some p and N. From the probability distribution for k, it's possible to sample a random value of k. According to the model, this is how the observed count Y was generated.

The four lines define a Bayesian model for the count data Y with one parameter p. The fifth line shows again the Bayes' theorem – now including the model above.

A final note on this: how is the posterior probability actually calculated? In practice, you never work with the posterior distribution function. It's far easier to work with samples from the posterior distribution. The most popular method to get samples from the posterior is to use Hamiltonion Monte Carlo algorithms. They work like this:

Imagine all possible combinations of values of data and parameters as a landscape with hills and valleys. The elevation of this landscape is the joint probability of data and parameters from Bayes' theorem, except inverted. The lowest points in the landscape represent the most probable combinations of data and parameters given the model. Since the data is fixed, we can only move in directions of the parameters.
We want to jump around in the parameter landscape in a way where we write down a record of our coordinates so that the frequencies are proportional to the joint probability: most records (samples) in our book correspond to the most probable values and only a few records are from the least probable areas.
So we can imagine taking a ball, setting it to some reasonable first-guess, and then pushing it to some direction. The ball will start rolling with some kinetic and potential energy in the landscape. (Infact Hamiltonian refers to the formulation of classical mechanics in physics describing the laws of motion from a perspective of momentum, trajectories, and the total energy of a system.)
We let the ball roll for a certain time or we can stop it given some condition. The current best algorithms avoid paths that just lead to the point where you started, so-called NUTS, no-U-turn samplers. When the journey is stopped, we can either accept or reject the new coordinates. Then, starting from the last recorded coordinate, the ball can be pushed again to fetch the next proposed coordinates.
If everything goes as planned, the sample coordinates recorded are independent samples from the posterior, mapping the correct density of the posterior. The series of samples is called a chain, and when the series is independent, the chains are said to be mixing well. You can check this visually by plotting the samples in order for each parameter.

20

\[ D_{KL}(p,q) = \sum_i p_i (log(p_i) - log(q_i)) = E[log(p)] - E[log(q)] \]

The K-L divergence of two probability distributions describes an average difference of their log-probabilities. Average log-probability is also called entropy which is a measure of uncertainty in the distribution – it grows as the distribution becomes wider and flatter. Divergence can be also seen as the additional entropy (uncertainty) introduced by using the candidate distribution q in addition to using the true distribution p.

21

\[ lppd(y, \Theta) = \sum_i log \times \frac{1}{S} \times \sum_s p(y_i|\theta_s) \]

The log-pointwise-predictive-density is the sum of the log-probabilities of each observation, using the average probability over the posterior samples. It is the main piece in different measures of predictive accuracy – high log-probability means that the model is saying that the observations were highly likely, or in other words, the model would be likely to predict values at the same area as was observed. The difference of lppd-based accuracies of two models approximates the K-L divergence of their predictive probability distributions, that is, the additional uncertainty (about the predicted value) introduced by using the less accurate model.

One mandatory warning about prediction accuracy: you should never take predictive accuracy on the training data too seriously – after all, the model already saw the data so it could just memorize them one-by-one in the extreme case. Methods like PSIS can approximate the lppd that one would get using cross-validation (training on a random part of the data and predicting on the remaining unseen part, test data).

22

\[ Y_i \sim Poisson(\lambda_i) \] \[ \lambda_i = exp(\kappa_{L_i}) \times \alpha A_i^\beta / \gamma \] \[ \mathbf{\kappa} \sim MVNormal(\mathbf{0}, \mathbf{\Kappa}) \] \[ \mathbf{\Kappa}_{ij} = \eta^2 exp(-\rho^2D^2_{ij} + \delta_{ij}(0.01) \] \[ \alpha, \beta, \gamma \sim Exponential(1) \] \[ \eta^2 \sim Exponential(2) \] \[ \rho^2 \sim Exponential(0.5) \]

We have observed a set of counts, each of which is generated from a Poisson distribution with its own event rate lambda. These event rates depend on the variables L and A in the described way (as well as on D deeper in the model). L is special here; it is a class with as many unique values as there are observations (so basically an ID). The model for lambda (event rate) has a separate intercept kappa for each observation. These intercepts kappa need to be modelled as well; here they are assumed to be generated from a zero-centered multivariate Gaussian distribution so that they can be correlated. The covariance of each intercept pair depends on the distance D between their corresponding observed units (this could be in any continuous variable, like literal distance in kilometers in this particular example from the Statistical Rethinking book).

The multivariate Gaussian including a so-called kernel function makes this model a Gaussian process. The effect of using a Gaussian process like this can be thought of as allowing the model to fit a distribution of smooth functions. It can be also thought of as allowing clustering by a continuous variable.

23

\[ \operatorname{E}(\mathbf{Y}\mid\mathbf{X}) = g^{-1}(\mathbf{X}\boldsymbol{\beta}) \]

The expected value of values generated by a random matrix Y – given values generated by a random matrix X – is equal to a matrix multiplication of a matrix of coefficients beta and the matrix X, transformed by an inverse function of some link function g. This is a generalized linear model in general matrix form.

24

To be continued...?

CC BY-SA 4.0 Eero Teppo. Last modified: March 23, 2025. Website built with Franklin.jl and the Julia programming language.