Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Linear Algebra Foundations

UCSD Psychology

Linear algebra is the language of statistical modeling. Every regression model, every coefficient estimate, every standard error ultimately comes from matrix operations. When you call model("y ~ x", data).fit(), bossanova is decomposing matrices, solving linear systems, and computing determinants behind the scenes.

This page builds intuition for the key decompositions bossanova uses internally, so you can understand why the math works rather than just trusting that it does. We start with concrete examples, show you the operations in plain NumPy, and then connect them to the formal theorems that bossanova’s property-based tests verify.

What is a design matrix?

In regression, we arrange our data into a matrix X\mathbf{X} where each row is an observation and each column is a predictor. The first column is typically all 1s, representing the intercept -- the baseline value of the outcome when all predictors are zero. The remaining columns hold the actual predictor values: continuous measurements, binary indicators, or any other numeric encoding of your variables.

# A small design matrix: 5 observations, 3 predictors
# Column 0: intercept (all 1s)
# Column 1: a continuous predictor (e.g., dosage)
# Column 2: a binary predictor (e.g., treatment vs control)
X = np.array([
    [1, 0.5, 0],
    [1, 1.2, 1],
    [1, 2.3, 0],
    [1, 3.1, 1],
    [1, 4.0, 0],
])
print("Design matrix X:")
print(X)
print(f"\nShape: {X.shape}  (rows = observations, columns = predictors)")
Design matrix X:
[[1.  0.5 0. ]
 [1.  1.2 1. ]
 [1.  2.3 0. ]
 [1.  3.1 1. ]
 [1.  4.  0. ]]

Shape: (5, 3)  (rows = observations, columns = predictors)

Each row is one observation. Each column is one predictor. The model’s job is to find coefficients β\boldsymbol{\beta} such that Xβy\mathbf{X}\boldsymbol{\beta} \approx \mathbf{y}, where y\mathbf{y} is the vector of outcomes.


QR Decomposition

The problem

We need to solve Xβ=y\mathbf{X}\boldsymbol{\beta} = \mathbf{y}, but X\mathbf{X} is not square -- it has more rows than columns (more observations than predictors). We cannot simply invert it. The classical approach is the normal equations XXβ=Xy\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}, but forming XX\mathbf{X}^\top\mathbf{X} squares the condition number and loses numerical precision. We need a better path.

Building intuition

QR decomposition factors X\mathbf{X} into two matrices: Q\mathbf{Q} (orthogonal) and R\mathbf{R} (upper triangular). “Orthogonal” means Q\mathbf{Q}'s columns are perpendicular unit vectors, so QQ=I\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}. “Upper triangular” means R\mathbf{R} has zeros below the diagonal, which makes it trivial to solve by back-substitution.

# QR decomposition of our design matrix
Q, R = np.linalg.qr(X)

print("Q (orthogonal matrix):")
print(Q)
print(f"\nR (upper triangular matrix):")
print(R)

# Key property: Q'Q = I
print(f"\nQ'Q (should be identity):")
print(Q.T @ Q)

# Reconstruction: Q @ R = X
print(f"\nQ @ R (should equal X):")
print(Q @ R)
Q (orthogonal matrix):
[[-0.4472 -0.6101 -0.3932]
 [-0.4472 -0.3618  0.5319]
 [-0.4472  0.0284 -0.3642]
 [-0.4472  0.3121  0.5625]
 [-0.4472  0.6314 -0.3369]]

R (upper triangular matrix):
[[-2.2361 -4.9641 -0.8944]
 [ 0.      2.8192 -0.0497]
 [ 0.      0.      1.0943]]

Q'Q (should be identity):
[[ 1. -0. -0.]
 [-0.  1. -0.]
 [-0. -0.  1.]]

Q @ R (should equal X):
[[ 1.   0.5 -0. ]
 [ 1.   1.2  1. ]
 [ 1.   2.3 -0. ]
 [ 1.   3.1  1. ]
 [ 1.   4.  -0. ]]

Why QR makes solving easy

Since X=QR\mathbf{X} = \mathbf{QR}, the normal equations simplify dramatically. Starting from XXβ=Xy\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}:

RQQRβ=RQy\mathbf{R}^\top\mathbf{Q}^\top\mathbf{Q}\mathbf{R}\boldsymbol{\beta} = \mathbf{R}^\top\mathbf{Q}^\top\mathbf{y}

Since QQ=I\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}, this collapses to Rβ=Qy\mathbf{R}\boldsymbol{\beta} = \mathbf{Q}^\top\mathbf{y}, which we solve by back-substitution.

# Create a response variable
y = np.array([1.1, 2.5, 1.8, 4.2, 3.0])

# Solve via QR: R * beta = Q'y
Qty = Q.T @ y
beta = np.linalg.solve(R, Qty)

print("Q'y:", Qty)
print("Coefficients (beta):", beta)
print("Fitted values (X @ beta):", X @ beta)
print("Residuals:", y - X @ beta)
Q'y: [-5.6349  1.6806  1.5932]
Coefficients (beta): [0.5573 0.6218 1.4559]
Fitted values (X @ beta): [0.8682 2.7593 1.9874 3.9407 3.0444]
Residuals: [ 0.2318 -0.2593 -0.1874  0.2593 -0.0444]

The formal statement

Theorem LA.1--3: For any matrix X\mathbf{X} with full column rank, the QR decomposition X=QR\mathbf{X} = \mathbf{QR} exists, where QQ=I\mathbf{Q}^\top\mathbf{Q} = \mathbf{I} and R\mathbf{R} is upper triangular and invertible.

Why does this work?

The orthogonality of Q\mathbf{Q} means we can project y\mathbf{y} onto the column space of X\mathbf{X} by simply computing Qy\mathbf{Q}^\top\mathbf{y}. The triangular structure of R\mathbf{R} means we can find the coefficients by back-substitution -- no matrix inversion needed. This avoids forming XX\mathbf{X}^\top\mathbf{X}, which preserves numerical precision.

Seeing it in bossanova

from bossanova.internal.maths.linalg import qr_solve

result = qr_solve(X, y)
print("Coefficients:", result.coef)
print("Residuals:", result.residuals[:5])
Coefficients: [0.5573 0.6218 1.4559]
Residuals: [ 0.2318 -0.2593 -0.1874  0.2593 -0.0444]

Singular Value Decomposition (SVD)

The problem

QR works well when X\mathbf{X} has full column rank. But what if columns are nearly collinear? For example, if two predictors are almost identical, the system is ill-conditioned and small changes in the data can produce wildly different coefficient estimates. We need a decomposition that reveals these problems.

Building intuition

SVD decomposes X\mathbf{X} into three matrices: X=UΣV\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top. The singular values in Σ\boldsymbol{\Sigma} tell you how “important” each direction in the predictor space is. Large singular values correspond to directions with lots of variation in the data; near-zero singular values signal collinearity problems.

# SVD of our design matrix
U, S, Vt = np.linalg.svd(X, full_matrices=False)

print(f"U shape: {U.shape}  (observations x rank)")
print(f"S (singular values): {S}")
print(f"Vt shape: {Vt.shape}  (rank x predictors)")

# Reconstruction: U @ diag(S) @ Vt = X
X_reconstructed = U @ np.diag(S) @ Vt
print(f"\nReconstruction error: {np.max(np.abs(X - X_reconstructed)):.2e}")
U shape: (5, 3)  (observations x rank)
S (singular values): [6.0939 1.3043 0.8679]
Vt shape: (3, 3)  (rank x predictors)

Reconstruction error: 2.66e-15

The formal statement

Theorem LA.4: For any matrix X\mathbf{X}, the SVD X=UΣV\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top exists and satisfies XUΣV=0\|\mathbf{X} - \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top\| = 0.

Why does this work?

The SVD always exists, even for rank-deficient matrices. U\mathbf{U} and V\mathbf{V} are orthogonal, and Σ\boldsymbol{\Sigma} is diagonal with non-negative entries. The singular values are the square roots of the eigenvalues of XX\mathbf{X}^\top\mathbf{X}. This means the SVD simultaneously diagonalizes XX\mathbf{X}^\top\mathbf{X} and XX\mathbf{X}\mathbf{X}^\top, giving you the most informative decomposition of any matrix.

Condition numbers

The condition number is the ratio of the largest to the smallest singular value. It measures how sensitive the solution is to perturbations in the data. A condition number near 1 means the system is well-conditioned; a condition number above 1000 is a warning sign; above 1010 means you are likely losing most of your floating-point precision.

# Condition number of our well-behaved design matrix
cond_good = S[0] / S[-1]
print(f"Condition number (good X): {cond_good:.1f}")

# Now create a nearly collinear design matrix
X_collinear = np.column_stack([
    np.ones(5),
    np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
    np.array([1.001, 2.002, 3.003, 4.004, 5.005]),  # almost identical to column 1
])
_, S_bad, _ = np.linalg.svd(X_collinear, full_matrices=False)
cond_bad = S_bad[0] / S_bad[-1]

print(f"Singular values (collinear X): {S_bad}")
print(f"Condition number (collinear X): {cond_bad:.1f}")
print(f"\nThe collinear matrix is {cond_bad/cond_good:.0f}x harder to solve stably.")
Condition number (good X): 7.0
Singular values (collinear X): [10.688   0.9361  0.    ]
Condition number (collinear X): 14790687026641976.0

The collinear matrix is 2106610230376675x harder to solve stably.

The Pseudoinverse

The problem

When X\mathbf{X} is rank-deficient (or nearly so), the normal equations do not have a unique solution. The standard inverse X1\mathbf{X}^{-1} does not exist for non-square matrices, and even (XX)1(\mathbf{X}^\top\mathbf{X})^{-1} fails when columns are linearly dependent.

Building intuition

The Moore-Penrose pseudoinverse X+\mathbf{X}^+ generalizes matrix inversion. It gives the “best” least-squares solution even when a true inverse does not exist. The key property is that applying X\mathbf{X}, then X+\mathbf{X}^+, then X\mathbf{X} again gives you back the original: XX+X=X\mathbf{X}\mathbf{X}^+\mathbf{X} = \mathbf{X}.

# Pseudoinverse of our design matrix
X_pinv = np.linalg.pinv(X)
print(f"X shape:    {X.shape}")
print(f"X+ shape:   {X_pinv.shape}")

# Key property: X @ X+ @ X = X
reconstructed = X @ X_pinv @ X
print(f"\nX @ X+ @ X (should equal X):")
print(reconstructed)
print(f"Max error: {np.max(np.abs(X - reconstructed)):.2e}")
X shape:    (5, 3)
X+ shape:   (3, 5)

X @ X+ @ X (should equal X):
[[ 1.   0.5  0. ]
 [ 1.   1.2  1. ]
 [ 1.   2.3  0. ]
 [ 1.   3.1  1. ]
 [ 1.   4.  -0. ]]
Max error: 2.66e-15

The formal statement

Theorem LA.5: For any matrix X\mathbf{X}, the pseudoinverse X+\mathbf{X}^+ satisfies XX+X=X\mathbf{X}\mathbf{X}^+\mathbf{X} = \mathbf{X}.

Why does this work?

X+\mathbf{X}^+ inverts X\mathbf{X} on its column space and projects to zero on the null space. Applying X\mathbf{X} recovers the original because XX+\mathbf{X}\mathbf{X}^+ is itself a projection onto the column space of X\mathbf{X}. Anything in the column space passes through unchanged, and anything outside it was never representable by X\mathbf{X} in the first place.


Log-Determinant via Cholesky

The problem

Mixed models need the log-determinant of large covariance matrices for likelihood evaluation. Computing the determinant directly is numerically dangerous (it can overflow or underflow for large matrices) and expensive. We need a stable, efficient alternative.

Building intuition

For a positive-definite matrix A\mathbf{A}, the Cholesky factorization A=LL\mathbf{A} = \mathbf{L}\mathbf{L}^\top gives us the determinant cheaply. Since det(A)=det(L)2\det(\mathbf{A}) = \det(\mathbf{L})^2 and det(L)\det(\mathbf{L}) is just the product of its diagonal entries (because L\mathbf{L} is triangular), the log-determinant becomes a simple sum of logarithms.

# X'X is positive definite (when X has full column rank)
A = X.T @ X
L = np.linalg.cholesky(A)

logdet_direct = np.log(np.linalg.det(A))
logdet_cholesky = 2 * np.sum(np.log(np.diag(L)))

print(f"Direct:   {logdet_direct:.6f}")
print(f"Cholesky: {logdet_cholesky:.6f}")
print(f"Match:    {np.isclose(logdet_direct, logdet_cholesky)}")
Direct:   3.862623
Cholesky: 3.862623
Match:    True

The formal statement

Theorem LA.7: For positive definite A=LL\mathbf{A} = \mathbf{L}\mathbf{L}^\top, logA=2log(diag(L))\log|\mathbf{A}| = 2\sum\log(\text{diag}(\mathbf{L})).

Why does this work?

det(A)=det(L)det(L)=det(L)2=(Lii)2\det(\mathbf{A}) = \det(\mathbf{L})\det(\mathbf{L}^\top) = \det(\mathbf{L})^2 = (\prod L_{ii})^2. Taking logs converts the product to a sum: logA=2log(Lii)\log|\mathbf{A}| = 2 \cdot \sum \log(L_{ii}). This reduces an O(n3)O(n^3) determinant computation to O(n)O(n) after the Cholesky factorization. Working in log space also avoids the overflow and underflow problems that plague direct determinant computation for large matrices.


Summary

TheoremStatementWhat it means
LA.1--3QR decomposition exists with QQ=I\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}Efficient, stable linear system solving
LA.4SVD reconstruction X=UΣV\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\topUniversal decomposition, reveals rank
LA.5XX+X=X\mathbf{X}\mathbf{X}^+\mathbf{X} = \mathbf{X}Generalized inverse for rank-deficient systems
LA.7$\log\mathbf{A}