Skip to main content eteppo

Matrices To The Rescue

Published: 2024-07-22
Updated: 2024-07-22

When deep learning started to become popular, the amount of people scratching their heads over dot products and tensors exploded. At first it feels more natural to think of numbers one-by-one and operate in loops but matrix math is just so efficient you want to use it.

In data analysis, you usually want to think of your tidy table as being generated by random variables. You want to build models of those variables — a model of the whole process that could’ve generated the values that you have observed. You use the model to predict and you compare predictions in some real or hypothetical observations of interest. This is analysis, “the act of studying or examining something in detail, in order to discover or understand more about it, or your opinion and judgment after doing this”.

So, thinking in loops, you might take your first row of data and say that it was generated in some simple way looking like

Yi=1=β0+β1×X1i=1+...+βp×Xpi=1+ϵi=1Y_{i=1} = \beta_0 + \beta_1 \times X_{1_{i=1}} + ... + \beta_p \times X_{p_{i=1}} + \epsilon_{i=1}

The second row was generated…

Yi=2=β0+β1×X1i=2+...+βp×Xpi=2+ϵi=2Y_{i=2} = \beta_0 + \beta_1 \times X_{1_{i=2}} + ... + \beta_p \times X_{p_{i=2}} + \epsilon_{i=2}

And so on.

Matrix representation

But we should back up and look at what we have. First, Y is a list of numbers so think of it as a single thing, a matrix generated by a matrix of random variables (a random matrix)

Y=[Yi=1Yi=2Yi=n]\mathbf{Y} = \begin{bmatrix} Y_{i=1} \\ Y_{i=2} \\ \vdots \\ Y_{i=n} \end{bmatrix}

Next, the variables from X1X_1 to XpX_p are all similar matrices, and together form

X=[X1i=1Xpi=1X1i=2Xpi=2X1i=nXpi=n]\mathbf{X} = \begin{bmatrix} X_{1_{i=1}} & \cdots & X_{p_{i=1}} \\ X_{1_{i=2}} & \cdots & X_{p_{i=2}} \\ \vdots & \cdots & \vdots \\ X_{1_{i=n}} & \cdots & X_{p_{i=n}} \end{bmatrix}

In the model above, another thing we were listing is the beta coefficients from 0 up to p.

β=[β0β1βp]\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}

And epsilon is another matrix ϵ\boldsymbol{\epsilon} with the same dimensions as Y above (one value in each equation).

Okay. In the model we can see that the beta coefficients each multiply all of the values of X in one column of the matrix. We can also include β0\beta_0 if we think of it as multiplying always 1. This is where we find our first operation on matrices: the matrix multiplication.

Matrix multiplication

Matrix multiplication means that you loop through the first matrix row-by-row and the second matrix column-by-column. After multiplying each pair of numbers in the pair of row and column, you sum the products into a single number and collect the result to the output matrix row-by-row.

Let’s first add the 1 to X to go with β0\beta_0.

X=[1X1i=1Xpi=11X1i=2Xpi=21X1i=nXpi=n]\mathbf{X} = \begin{bmatrix} 1 & X_{1_{i=1}} & \cdots & X_{p_{i=1}} \\ 1 & X_{1_{i=2}} & \cdots & X_{p_{i=2}} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{1_{i=n}} & \cdots & X_{p_{i=n}} \end{bmatrix}

Now check what happens when we multiply

Y=Xβ\mathbf{Y} = \mathbf{X} \boldsymbol{\beta}

First row of X is paired with values in first column of B. Pairs are multiplied and then results summed together. This is exactly what need to happen to make the equation for the first row of data. The second row is the second equation and so on. And since B has only one column, the result will have only one column. When you multiply an n-by-p matrix with a p-by-1 matrix, the result is an n-by-1 matrix.

This is just like Y which has only one column and n rows. Matrix addition, on the other hand, just adds the corresponding cells in the matrix to produce a matrix with the exact same dimensions, so we can finally write the whole set of equations above as

Y=Xβ+ϵ\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}

Applying some one-to-one function to a matrix means to apply it to every value in the matrix, say, an inverse of a link function (in generalized linear models)…

E[YX]=g1(Xβ)E[\mathbf{Y}|\mathbf{X}] = g^{-1}(\mathbf{X} \boldsymbol{\beta})

When matrices are multiplied, their rows and columns are paired, pairs of numbers are multiplied, and then the results are summed into a single value. This is an operation defined for vectors, called a dot product. In matrix version of the dot procuct, you can use the transpose operation which means to flip around the matrix diagonally so that rows become columns.

YTβ=βTY=s\mathbf{Y}^T \boldsymbol{\beta} = \boldsymbol{\beta}^T \mathbf{Y} = s

s is a single number, a scalar, which in this case doesn’t make any sense — just a side-note about the dot product which is sometimes needed in writing statistical models programmatically.

Matrix calculus

By the way, when we use such linear models, a better way to think about what it means might be to consider the corresponding differential equation. Where does a model like this come from?

μ=Xβ\boldsymbol{\mu} = \mathbf{X} \boldsymbol{\beta}

YGaussian(μ,σ)\mathbf{Y} \sim \text{Gaussian}(\boldsymbol{\mu}, \boldsymbol{\sigma})

σExponential(1)\boldsymbol{\sigma} \sim \text{Exponential}(1)

For the linear model, we are effectively making an assumption that the rate of change of Y is constant when each X1...pX_{1...p} changes a tiny bit holding other variables constant. It doesn’t depend on Y or anything else.

YX=βTIn\frac{\partial Y}{\partial X} = \boldsymbol{\beta}^T \otimes \mathbf{I}_{n}

What is this weird right-hand thing, apart from the beta coefficients? The weird operator is the Kronecker “direct” product which means that you take each value in beta and multiply it with the identity matrix (1 on diagonals and 0 elsewhere). The resulting matrix has blocks as large as I in the shape of the left-multiplier. Integrating the right-hand side over X nevertheless gives you the linear model XB.

This is not super relevant but let me make it clearer. Taking derivatives of matrix functions of matrices can be defined as

(vec(f(X)))(vec(X)T)=[f1(x)x1f1(x)xpqfmn(x)x1fmn(x)xpq]\frac{\partial (\text{vec}(f(X)))}{\partial (\text{vec}(X)^T)} = \begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \cdots & \frac{\partial f_1(x)}{\partial x_{pq}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_{mn}(x)}{\partial x_1} & \cdots & \frac{\partial f_{mn}(x)}{\partial x_{pq}} \\ \end{bmatrix}

vec is an operation that stacks columns in a matrix into a single column. Output here is a m-by-n matrix and the input X is a p-by-q matrix. So we just make them vectors and then do what you would do for a vector function of a vector.

All in all, “the derivative” is a matrix of partial derivatives (a so-called Jacobian) — its transpose is called a gradient and the second derivative is called Hessian. Remembering that the matrix function represents many individual functions, we can see that the full partial-derivative-matrix should contain all of the partial derivatives of the individual functions. The partial derivatives should reflect the dependencies described by the linear system of functions (like observation 10 shouldn’t affect observation 1).

Let’s just check how this should work for the linear model.

We have an n-by-1 matrix function (output is Y) of an n-by-p matrix (input is X) so the derivative matrix should be n-by-np. The first block created by the Kronecker product is for the first coefficient beta of the first column in X. Each number of Y has a partial derivative only with respect to the same index n; other partial derivatives that are off the diagonal are zero so that the Y in observation 5 doesn’t depend on the X in observation 10.

Note how the off-diagonal zeros come from the Kronecker product with the identity matrix (which has 1 on the diagonal and 0 off the diagonal).

Okay. Our little introduction to matrix calculus started from the point that linear regression such as ours assumes constant partial derivatives. Similar “derivations” from differential equations can be made for other models. For example, the logistic regression models like, in logit-form…

logit(E[YiXi])=logit(pi)=ln(pi1pi)=βXi\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i]) = \operatorname{logit}(p_i)=\ln\left(\frac{p_i}{1-p_i}\right) = \boldsymbol\beta \cdot \mathbf{X}_i

or in logistic form (logit and logistic are inverses) like

E[YiXi]=pi=logit1(βXi)=11+eβXi\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i] = p_i = \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) = \frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}

as the logistic function is like

p(x)=11+exp(x) = \frac {1}{1+e^{-x}}

… can be seen as a consequence of assuming that the derivative (rate of change) depends on the value itself, multiplied by its distance from a maximum like…

f(x)=r×(1f(x)K)×f(x)f'(x) = r \times \left(1- \frac{f(x)}{K}\right) \times f(x)

r is a parameter for growth and K is the maximum of the function. In logistic regression, we are modelling a probability that has a natural maximum value of 1. This assumption implies that the growth speed is exponential until it gets closer to the maximum and slows down the growth towards 0. This creates the well-known S-shape.

Matrix interpretation

A great thing about matrices is that all of them can be thought of as linear transformations of space (functions mapping points in one space to points in another space) in a simple way. The j columns of a matrix tell the points where each perpendicular unit vector (coordinate system!) in j dimensional space land during the linear transformation. For example, a matrix with one column and p rows sends the single number 1 in 1D space to the specific point in p-dimensional space.

Moreover, matrix multiplication means that those transformations are composed right-to-left. Xb means applying b first and then applying X and the resulting matrix is the net transformation of applying them both, which in our case should be close to applying Y alone.

Many more matrix operations make intuitive sense:

  • Inverse of a matrix M1M^{-1} is the reverse of the transformation.
  • Determinant of a matrix det(M)\text{det}(M) measures what the transformation does to areas in the input space.
  • Eigenvectors of a matrix are the points in the coordinate system that get transformed only by scaling (they move in the direction that they point — their spans). They are like the axes of rotation if the transformation is a rotation. Eigenvalues then measure how much the eigenvectors are scaled.
  • A transformation can be represented in other coordinate systems than the unit vectors by applying the right transformation and it’s inverse before and after the transformation of interest, like A1MAA^{-1}MA.