Linear algebra is the language of statistical modeling. Every regression model, every
coefficient estimate, every standard error ultimately comes from matrix operations. When
you call model("y ~ x", data).fit(), bossanova is decomposing matrices, solving linear
systems, and computing determinants behind the scenes.
This page builds intuition for the key decompositions bossanova uses internally, so you can understand why the math works rather than just trusting that it does. We start with concrete examples, show you the operations in plain NumPy, and then connect them to the formal theorems that bossanova’s property-based tests verify.
What is a design matrix?¶
In regression, we arrange our data into a matrix where each row is an observation and each column is a predictor. The first column is typically all 1s, representing the intercept -- the baseline value of the outcome when all predictors are zero. The remaining columns hold the actual predictor values: continuous measurements, binary indicators, or any other numeric encoding of your variables.
# A small design matrix: 5 observations, 3 predictors
# Column 0: intercept (all 1s)
# Column 1: a continuous predictor (e.g., dosage)
# Column 2: a binary predictor (e.g., treatment vs control)
X = np.array([
[1, 0.5, 0],
[1, 1.2, 1],
[1, 2.3, 0],
[1, 3.1, 1],
[1, 4.0, 0],
])
print("Design matrix X:")
print(X)
print(f"\nShape: {X.shape} (rows = observations, columns = predictors)")Design matrix X:
[[1. 0.5 0. ]
[1. 1.2 1. ]
[1. 2.3 0. ]
[1. 3.1 1. ]
[1. 4. 0. ]]
Shape: (5, 3) (rows = observations, columns = predictors)
Each row is one observation. Each column is one predictor. The model’s job is to find coefficients such that , where is the vector of outcomes.
QR Decomposition¶
The problem¶
We need to solve , but is not square -- it has more rows than columns (more observations than predictors). We cannot simply invert it. The classical approach is the normal equations , but forming squares the condition number and loses numerical precision. We need a better path.
Building intuition¶
QR decomposition factors into two matrices: (orthogonal) and (upper triangular). “Orthogonal” means 's columns are perpendicular unit vectors, so . “Upper triangular” means has zeros below the diagonal, which makes it trivial to solve by back-substitution.
# QR decomposition of our design matrix
Q, R = np.linalg.qr(X)
print("Q (orthogonal matrix):")
print(Q)
print(f"\nR (upper triangular matrix):")
print(R)
# Key property: Q'Q = I
print(f"\nQ'Q (should be identity):")
print(Q.T @ Q)
# Reconstruction: Q @ R = X
print(f"\nQ @ R (should equal X):")
print(Q @ R)Q (orthogonal matrix):
[[-0.4472 -0.6101 -0.3932]
[-0.4472 -0.3618 0.5319]
[-0.4472 0.0284 -0.3642]
[-0.4472 0.3121 0.5625]
[-0.4472 0.6314 -0.3369]]
R (upper triangular matrix):
[[-2.2361 -4.9641 -0.8944]
[ 0. 2.8192 -0.0497]
[ 0. 0. 1.0943]]
Q'Q (should be identity):
[[ 1. -0. -0.]
[-0. 1. -0.]
[-0. -0. 1.]]
Q @ R (should equal X):
[[ 1. 0.5 -0. ]
[ 1. 1.2 1. ]
[ 1. 2.3 -0. ]
[ 1. 3.1 1. ]
[ 1. 4. -0. ]]
Why QR makes solving easy¶
Since , the normal equations simplify dramatically. Starting from :
Since , this collapses to , which we solve by back-substitution.
# Create a response variable
y = np.array([1.1, 2.5, 1.8, 4.2, 3.0])
# Solve via QR: R * beta = Q'y
Qty = Q.T @ y
beta = np.linalg.solve(R, Qty)
print("Q'y:", Qty)
print("Coefficients (beta):", beta)
print("Fitted values (X @ beta):", X @ beta)
print("Residuals:", y - X @ beta)Q'y: [-5.6349 1.6806 1.5932]
Coefficients (beta): [0.5573 0.6218 1.4559]
Fitted values (X @ beta): [0.8682 2.7593 1.9874 3.9407 3.0444]
Residuals: [ 0.2318 -0.2593 -0.1874 0.2593 -0.0444]
The formal statement¶
Theorem LA.1--3: For any matrix with full column rank, the QR decomposition exists, where and is upper triangular and invertible.
Why does this work?
The orthogonality of means we can project onto the column space of by simply computing . The triangular structure of means we can find the coefficients by back-substitution -- no matrix inversion needed. This avoids forming , which preserves numerical precision.
Seeing it in bossanova¶
from bossanova.internal.maths.linalg import qr_solve
result = qr_solve(X, y)
print("Coefficients:", result.coef)
print("Residuals:", result.residuals[:5])Coefficients: [0.5573 0.6218 1.4559]
Residuals: [ 0.2318 -0.2593 -0.1874 0.2593 -0.0444]
Singular Value Decomposition (SVD)¶
The problem¶
QR works well when has full column rank. But what if columns are nearly collinear? For example, if two predictors are almost identical, the system is ill-conditioned and small changes in the data can produce wildly different coefficient estimates. We need a decomposition that reveals these problems.
Building intuition¶
SVD decomposes into three matrices: . The singular values in tell you how “important” each direction in the predictor space is. Large singular values correspond to directions with lots of variation in the data; near-zero singular values signal collinearity problems.
# SVD of our design matrix
U, S, Vt = np.linalg.svd(X, full_matrices=False)
print(f"U shape: {U.shape} (observations x rank)")
print(f"S (singular values): {S}")
print(f"Vt shape: {Vt.shape} (rank x predictors)")
# Reconstruction: U @ diag(S) @ Vt = X
X_reconstructed = U @ np.diag(S) @ Vt
print(f"\nReconstruction error: {np.max(np.abs(X - X_reconstructed)):.2e}")U shape: (5, 3) (observations x rank)
S (singular values): [6.0939 1.3043 0.8679]
Vt shape: (3, 3) (rank x predictors)
Reconstruction error: 2.66e-15
The formal statement¶
Theorem LA.4: For any matrix , the SVD exists and satisfies .
Why does this work?
The SVD always exists, even for rank-deficient matrices. and are orthogonal, and is diagonal with non-negative entries. The singular values are the square roots of the eigenvalues of . This means the SVD simultaneously diagonalizes and , giving you the most informative decomposition of any matrix.
Condition numbers¶
The condition number is the ratio of the largest to the smallest singular value. It measures how sensitive the solution is to perturbations in the data. A condition number near 1 means the system is well-conditioned; a condition number above 1000 is a warning sign; above 1010 means you are likely losing most of your floating-point precision.
# Condition number of our well-behaved design matrix
cond_good = S[0] / S[-1]
print(f"Condition number (good X): {cond_good:.1f}")
# Now create a nearly collinear design matrix
X_collinear = np.column_stack([
np.ones(5),
np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
np.array([1.001, 2.002, 3.003, 4.004, 5.005]), # almost identical to column 1
])
_, S_bad, _ = np.linalg.svd(X_collinear, full_matrices=False)
cond_bad = S_bad[0] / S_bad[-1]
print(f"Singular values (collinear X): {S_bad}")
print(f"Condition number (collinear X): {cond_bad:.1f}")
print(f"\nThe collinear matrix is {cond_bad/cond_good:.0f}x harder to solve stably.")Condition number (good X): 7.0
Singular values (collinear X): [10.688 0.9361 0. ]
Condition number (collinear X): 14790687026641976.0
The collinear matrix is 2106610230376675x harder to solve stably.
The Pseudoinverse¶
The problem¶
When is rank-deficient (or nearly so), the normal equations do not have a unique solution. The standard inverse does not exist for non-square matrices, and even fails when columns are linearly dependent.
Building intuition¶
The Moore-Penrose pseudoinverse generalizes matrix inversion. It gives the “best” least-squares solution even when a true inverse does not exist. The key property is that applying , then , then again gives you back the original: .
# Pseudoinverse of our design matrix
X_pinv = np.linalg.pinv(X)
print(f"X shape: {X.shape}")
print(f"X+ shape: {X_pinv.shape}")
# Key property: X @ X+ @ X = X
reconstructed = X @ X_pinv @ X
print(f"\nX @ X+ @ X (should equal X):")
print(reconstructed)
print(f"Max error: {np.max(np.abs(X - reconstructed)):.2e}")X shape: (5, 3)
X+ shape: (3, 5)
X @ X+ @ X (should equal X):
[[ 1. 0.5 0. ]
[ 1. 1.2 1. ]
[ 1. 2.3 0. ]
[ 1. 3.1 1. ]
[ 1. 4. -0. ]]
Max error: 2.66e-15
The formal statement¶
Theorem LA.5: For any matrix , the pseudoinverse satisfies .
Why does this work?
inverts on its column space and projects to zero on the null space. Applying recovers the original because is itself a projection onto the column space of . Anything in the column space passes through unchanged, and anything outside it was never representable by in the first place.
Log-Determinant via Cholesky¶
The problem¶
Mixed models need the log-determinant of large covariance matrices for likelihood evaluation. Computing the determinant directly is numerically dangerous (it can overflow or underflow for large matrices) and expensive. We need a stable, efficient alternative.
Building intuition¶
For a positive-definite matrix , the Cholesky factorization gives us the determinant cheaply. Since and is just the product of its diagonal entries (because is triangular), the log-determinant becomes a simple sum of logarithms.
# X'X is positive definite (when X has full column rank)
A = X.T @ X
L = np.linalg.cholesky(A)
logdet_direct = np.log(np.linalg.det(A))
logdet_cholesky = 2 * np.sum(np.log(np.diag(L)))
print(f"Direct: {logdet_direct:.6f}")
print(f"Cholesky: {logdet_cholesky:.6f}")
print(f"Match: {np.isclose(logdet_direct, logdet_cholesky)}")Direct: 3.862623
Cholesky: 3.862623
Match: True
The formal statement¶
Theorem LA.7: For positive definite , .
Why does this work?
. Taking logs converts the product to a sum: . This reduces an determinant computation to after the Cholesky factorization. Working in log space also avoids the overflow and underflow problems that plague direct determinant computation for large matrices.
Summary¶
| Theorem | Statement | What it means |
|---|---|---|
| LA.1--3 | QR decomposition exists with | Efficient, stable linear system solving |
| LA.4 | SVD reconstruction | Universal decomposition, reveals rank |
| LA.5 | Generalized inverse for rank-deficient systems | |
| LA.7 | $\log | \mathbf{A} |