Linear Algebra Foundations

Linear algebra is the language of statistical modeling. Every regression model, every coefficient estimate, every standard error ultimately comes from matrix operations. When you call model("y ~ x", data).fit(), bossanova is decomposing matrices, solving linear systems, and computing determinants behind the scenes.

This page builds intuition for the key decompositions bossanova uses internally, so you can understand why the math works rather than just trusting that it does. We start with concrete examples, show you the operations in plain NumPy, and then connect them to the formal theorems that bossanova’s property-based tests verify.

What is a design matrix?¶

In regression, we arrange our data into a matrix $\mathbf{X}$ where each row is an observation and each column is a predictor. The first column is typically all 1s, representing the intercept -- the baseline value of the outcome when all predictors are zero. The remaining columns hold the actual predictor values: continuous measurements, binary indicators, or any other numeric encoding of your variables.

# A small design matrix: 5 observations, 3 predictors
# Column 0: intercept (all 1s)
# Column 1: a continuous predictor (e.g., dosage)
# Column 2: a binary predictor (e.g., treatment vs control)
X = np.array([
    [1, 0.5, 0],
    [1, 1.2, 1],
    [1, 2.3, 0],
    [1, 3.1, 1],
    [1, 4.0, 0],
])
print("Design matrix X:")
print(X)
print(f"\nShape: {X.shape}  (rows = observations, columns = predictors)")

Design matrix X:
[[1.  0.5 0. ]
 [1.  1.2 1. ]
 [1.  2.3 0. ]
 [1.  3.1 1. ]
 [1.  4.  0. ]]

Shape: (5, 3)  (rows = observations, columns = predictors)

Each row is one observation. Each column is one predictor. The model’s job is to find coefficients $\boldsymbol{\beta}$ such that $\mathbf{X}\boldsymbol{\beta} \approx \mathbf{y}$ , where $\mathbf{y}$ is the vector of outcomes.

QR Decomposition¶

The problem¶

We need to solve $\mathbf{X}\boldsymbol{\beta} = \mathbf{y}$ , but $\mathbf{X}$ is not square -- it has more rows than columns (more observations than predictors). We cannot simply invert it. The classical approach is the normal equations $\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}$ , but forming $\mathbf{X}^\top\mathbf{X}$ squares the condition number and loses numerical precision. We need a better path.

Building intuition¶

QR decomposition factors $\mathbf{X}$ into two matrices: $\mathbf{Q}$ (orthogonal) and $\mathbf{R}$ (upper triangular). “Orthogonal” means $\mathbf{Q}$ 's columns are perpendicular unit vectors, so $\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}$ . “Upper triangular” means $\mathbf{R}$ has zeros below the diagonal, which makes it trivial to solve by back-substitution.

# QR decomposition of our design matrix
Q, R = np.linalg.qr(X)

print("Q (orthogonal matrix):")
print(Q)
print(f"\nR (upper triangular matrix):")
print(R)

# Key property: Q'Q = I
print(f"\nQ'Q (should be identity):")
print(Q.T @ Q)

# Reconstruction: Q @ R = X
print(f"\nQ @ R (should equal X):")
print(Q @ R)

Q (orthogonal matrix):
[[-0.4472 -0.6101 -0.3932]
 [-0.4472 -0.3618  0.5319]
 [-0.4472  0.0284 -0.3642]
 [-0.4472  0.3121  0.5625]
 [-0.4472  0.6314 -0.3369]]

R (upper triangular matrix):
[[-2.2361 -4.9641 -0.8944]
 [ 0.      2.8192 -0.0497]
 [ 0.      0.      1.0943]]

Q'Q (should be identity):
[[ 1. -0. -0.]
 [-0.  1. -0.]
 [-0. -0.  1.]]

Q @ R (should equal X):
[[ 1.   0.5 -0. ]
 [ 1.   1.2  1. ]
 [ 1.   2.3 -0. ]
 [ 1.   3.1  1. ]
 [ 1.   4.  -0. ]]

Why QR makes solving easy¶

Since $\mathbf{X} = \mathbf{QR}$ , the normal equations simplify dramatically. Starting from $\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}$ :

\mathbf{R}^\top\mathbf{Q}^\top\mathbf{Q}\mathbf{R}\boldsymbol{\beta} = \mathbf{R}^\top\mathbf{Q}^\top\mathbf{y}

(1)

Since $\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}$ , this collapses to $\mathbf{R}\boldsymbol{\beta} = \mathbf{Q}^\top\mathbf{y}$ , which we solve by back-substitution.

# Create a response variable
y = np.array([1.1, 2.5, 1.8, 4.2, 3.0])

# Solve via QR: R * beta = Q'y
Qty = Q.T @ y
beta = np.linalg.solve(R, Qty)

print("Q'y:", Qty)
print("Coefficients (beta):", beta)
print("Fitted values (X @ beta):", X @ beta)
print("Residuals:", y - X @ beta)

Q'y: [-5.6349  1.6806  1.5932]
Coefficients (beta): [0.5573 0.6218 1.4559]
Fitted values (X @ beta): [0.8682 2.7593 1.9874 3.9407 3.0444]
Residuals: [ 0.2318 -0.2593 -0.1874  0.2593 -0.0444]

The formal statement¶

Theorem LA.1--3: For any matrix $\mathbf{X}$ with full column rank, the QR decomposition $\mathbf{X} = \mathbf{QR}$ exists, where $\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}$ and $\mathbf{R}$ is upper triangular and invertible.

Seeing it in bossanova¶

from bossanova.internal.maths.linalg import qr_solve

result = qr_solve(X, y)
print("Coefficients:", result.coef)
print("Residuals:", result.residuals[:5])

Coefficients: [0.5573 0.6218 1.4559]
Residuals: [ 0.2318 -0.2593 -0.1874  0.2593 -0.0444]

Singular Value Decomposition (SVD)¶

The problem¶

QR works well when $\mathbf{X}$ has full column rank. But what if columns are nearly collinear? For example, if two predictors are almost identical, the system is ill-conditioned and small changes in the data can produce wildly different coefficient estimates. We need a decomposition that reveals these problems.

Building intuition¶

SVD decomposes $\mathbf{X}$ into three matrices: $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$ . The singular values in $\boldsymbol{\Sigma}$ tell you how “important” each direction in the predictor space is. Large singular values correspond to directions with lots of variation in the data; near-zero singular values signal collinearity problems.

# SVD of our design matrix
U, S, Vt = np.linalg.svd(X, full_matrices=False)

print(f"U shape: {U.shape}  (observations x rank)")
print(f"S (singular values): {S}")
print(f"Vt shape: {Vt.shape}  (rank x predictors)")

# Reconstruction: U @ diag(S) @ Vt = X
X_reconstructed = U @ np.diag(S) @ Vt
print(f"\nReconstruction error: {np.max(np.abs(X - X_reconstructed)):.2e}")

U shape: (5, 3)  (observations x rank)
S (singular values): [6.0939 1.3043 0.8679]
Vt shape: (3, 3)  (rank x predictors)

Reconstruction error: 2.66e-15

The formal statement¶

Theorem LA.4: For any matrix $\mathbf{X}$ , the SVD $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$ exists and satisfies $\|\mathbf{X} - \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top\| = 0$ .

Condition numbers¶

The condition number is the ratio of the largest to the smallest singular value. It measures how sensitive the solution is to perturbations in the data. A condition number near 1 means the system is well-conditioned; a condition number above 1000 is a warning sign; above 10¹⁰ means you are likely losing most of your floating-point precision.

# Condition number of our well-behaved design matrix
cond_good = S[0] / S[-1]
print(f"Condition number (good X): {cond_good:.1f}")

# Now create a nearly collinear design matrix
X_collinear = np.column_stack([
    np.ones(5),
    np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
    np.array([1.001, 2.002, 3.003, 4.004, 5.005]),  # almost identical to column 1
])
_, S_bad, _ = np.linalg.svd(X_collinear, full_matrices=False)
cond_bad = S_bad[0] / S_bad[-1]

print(f"Singular values (collinear X): {S_bad}")
print(f"Condition number (collinear X): {cond_bad:.1f}")
print(f"\nThe collinear matrix is {cond_bad/cond_good:.0f}x harder to solve stably.")

Condition number (good X): 7.0
Singular values (collinear X): [10.688   0.9361  0.    ]
Condition number (collinear X): 14790687026641976.0

The collinear matrix is 2106610230376675x harder to solve stably.

The Pseudoinverse¶

The problem¶

When $\mathbf{X}$ is rank-deficient (or nearly so), the normal equations do not have a unique solution. The standard inverse $\mathbf{X}^{-1}$ does not exist for non-square matrices, and even $(\mathbf{X}^\top\mathbf{X})^{-1}$ fails when columns are linearly dependent.

Building intuition¶

The Moore-Penrose pseudoinverse $\mathbf{X}^+$ generalizes matrix inversion. It gives the “best” least-squares solution even when a true inverse does not exist. The key property is that applying $\mathbf{X}$ , then $\mathbf{X}^+$ , then $\mathbf{X}$ again gives you back the original: $\mathbf{X}\mathbf{X}^+\mathbf{X} = \mathbf{X}$ .

# Pseudoinverse of our design matrix
X_pinv = np.linalg.pinv(X)
print(f"X shape:    {X.shape}")
print(f"X+ shape:   {X_pinv.shape}")

# Key property: X @ X+ @ X = X
reconstructed = X @ X_pinv @ X
print(f"\nX @ X+ @ X (should equal X):")
print(reconstructed)
print(f"Max error: {np.max(np.abs(X - reconstructed)):.2e}")

X shape:    (5, 3)
X+ shape:   (3, 5)

X @ X+ @ X (should equal X):
[[ 1.   0.5  0. ]
 [ 1.   1.2  1. ]
 [ 1.   2.3  0. ]
 [ 1.   3.1  1. ]
 [ 1.   4.  -0. ]]
Max error: 2.66e-15

The formal statement¶

Theorem LA.5: For any matrix $\mathbf{X}$ , the pseudoinverse $\mathbf{X}^+$ satisfies $\mathbf{X}\mathbf{X}^+\mathbf{X} = \mathbf{X}$ .

Log-Determinant via Cholesky¶

The problem¶

Mixed models need the log-determinant of large covariance matrices for likelihood evaluation. Computing the determinant directly is numerically dangerous (it can overflow or underflow for large matrices) and expensive. We need a stable, efficient alternative.

Building intuition¶

For a positive-definite matrix $\mathbf{A}$ , the Cholesky factorization $\mathbf{A} = \mathbf{L}\mathbf{L}^\top$ gives us the determinant cheaply. Since $\det(\mathbf{A}) = \det(\mathbf{L})^2$ and $\det(\mathbf{L})$ is just the product of its diagonal entries (because $\mathbf{L}$ is triangular), the log-determinant becomes a simple sum of logarithms.

# X'X is positive definite (when X has full column rank)
A = X.T @ X
L = np.linalg.cholesky(A)

logdet_direct = np.log(np.linalg.det(A))
logdet_cholesky = 2 * np.sum(np.log(np.diag(L)))

print(f"Direct:   {logdet_direct:.6f}")
print(f"Cholesky: {logdet_cholesky:.6f}")
print(f"Match:    {np.isclose(logdet_direct, logdet_cholesky)}")

Direct:   3.862623
Cholesky: 3.862623
Match:    True

The formal statement¶

Theorem LA.7: For positive definite $\mathbf{A} = \mathbf{L}\mathbf{L}^\top$ , $\log|\mathbf{A}| = 2\sum\log(\text{diag}(\mathbf{L}))$ .

Summary¶

Theorem	Statement	What it means
LA.1--3	QR decomposition exists with $\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}$	Efficient, stable linear system solving
LA.4	SVD reconstruction $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$	Universal decomposition, reveals rank
LA.5	$\mathbf{X}\mathbf{X}^+\mathbf{X} = \mathbf{X}$	Generalized inverse for rank-deficient systems
LA.7	$\log	\mathbf{A}