OLS is the workhorse of statistical modeling. When you run a regression --- in any language, any package --- OLS is almost certainly doing the work under the hood. Whether you’re fitting a simple line through a scatterplot or estimating coefficients in a model with dozens of predictors, the same mathematical machinery applies.
This page shows what OLS is actually computing and why the solution has the properties it does. We start with the fitting problem, derive the normal equations, and then explore the hat matrix --- the object that connects OLS to leverage, diagnostics, and degrees of freedom. Every theorem is backed by a property-based test in bossanova.
The fitting problem¶
What does “fitting a line” mean mathematically? We have data and want to find coefficients that minimize the sum of squared residuals:
The design matrix has one row per observation and one column per predictor (including the intercept column of ones). The response is the vector we’re trying to predict.
# Generate a small example: y = 3 + 2*x + noise
n = 20
x = np.random.randn(n)
X = np.column_stack([np.ones(n), x]) # intercept + one predictor
y = 3 + 2 * x + np.random.randn(n) * 0.5
print(f"Design matrix X: {X.shape} (n={n} observations, p={X.shape[1]} parameters)")
print(f"Response y: {y.shape}")Design matrix X: (20, 2) (n=20 observations, p=2 parameters)
Response y: (20,)
Normal equations¶
The problem¶
How do we find the that minimizes ?
Building intuition¶
Taking the derivative of the squared loss with respect to and setting it to zero gives the normal equations:
This says: the residuals must be orthogonal to every predictor column. If any column of were correlated with the residuals, we could improve the fit by adjusting the corresponding coefficient. At the optimum, there’s no more linear signal left to extract.
# Solve via normal equations directly: beta = inv(X'X) @ X'y
beta = np.linalg.inv(X.T @ X) @ X.T @ y
print("Coefficients (true: [3, 2]):", beta)
# Verify: X'(y - X*beta) should be near zero
residuals = y - X @ beta
print("X'e (should be ~0):", X.T @ residuals)Coefficients (true: [3, 2]): [2.8534 1.9207]
X'e (should be ~0): [ 0. -0.]
The geometric picture¶
OLS projects onto the column space of . The fitted values are the closest point in to . The residuals are perpendicular to .
# Solve via QR decomposition (numerically stable)
Q, R = np.linalg.qr(X)
beta_qr = np.linalg.solve(R, Q.T @ y)
residuals = y - X @ beta_qr
print("Coefficients via QR:", beta_qr)
print("X'e (should be ~0):", X.T @ residuals)Coefficients via QR: [2.8534 1.9207]
X'e (should be ~0): [ 0. -0.]
The formal statement¶
Theorem OLS.1: At the OLS solution, .
Why does this work?
Geometrically, OLS projects onto . The residual vector lies in the orthogonal complement of , so . This is the first-order optimality condition --- at the minimum, the gradient is zero.
Column-by-column orthogonality¶
Theorem OLS.4: Each column of is orthogonal to the residuals.
This is OLS.1 stated column-by-column. Instead of checking the whole matrix product at once, we can verify that each individual predictor has zero correlation with the residuals.
# Show that X[:,j].T @ residuals ≈ 0 for each column
for j in range(X.shape[1]):
dot = X[:, j].T @ residuals
label = "intercept" if j == 0 else f"x_{j}"
print(f" {label}' @ e = {dot:.2e}") intercept' @ e = 1.22e-14
x_1' @ e = -2.46e-15
Why does this matter?
If any column were correlated with the residuals, we could improve the fit by adjusting that coefficient. Orthogonality means we’ve extracted all the linear signal from every predictor. This is what makes OLS optimal among linear estimators.
The hat matrix¶
The problem¶
The hat matrix “puts a hat on ” --- it maps to . Understanding is the key to diagnostics, leverage, and degrees of freedom.
Building intuition¶
is a projection matrix. It projects any vector onto the column space of . The fitted values are exactly , and the residuals are .
# Compute the hat matrix
H = X @ np.linalg.inv(X.T @ X) @ X.T
print(f"Hat matrix shape: {H.shape} (n x n)")
# Verify: H @ y gives the same fitted values as X @ beta
y_hat_from_H = H @ y
y_hat_from_beta = X @ beta_qr
print("Hy == Xbeta?", np.allclose(y_hat_from_H, y_hat_from_beta))Hat matrix shape: (20, 20) (n x n)
Hy == Xbeta? True
Idempotence: H squared equals H¶
Theorem OLS.5: (projection property).
Projecting twice onto the same subspace gives the same result as projecting once. Once a vector is already in , projecting it again doesn’t move it.
# H @ H should equal H
H_squared = H @ H
print("H^2 == H?", np.allclose(H_squared, H))
print("Max difference:", np.max(np.abs(H_squared - H)))H^2 == H? True
Max difference: 4.163336342344337e-17
Why does this work?
If , then . Projecting the fitted values doesn’t change them --- they’re already in the subspace.
Trace equals number of parameters¶
Theorem OLS.6: .
The trace of the hat matrix equals the number of parameters. This is why the trace shows up in degrees-of-freedom calculations everywhere in statistics.
p = X.shape[1]
print(f"tr(H) = {np.trace(H):.4f}")
print(f"p = {p}")
print(f"Match? {np.isclose(np.trace(H), p)}")tr(H) = 2.0000
p = 2
Match? True
Why does this work?
The trace of a projection matrix equals the dimension of the subspace it projects onto. projects onto , which has dimension (when has full column rank). So .
Binary eigenvalues¶
Theorem OLS.7: The eigenvalues of are all 0 or 1.
A projection matrix can only scale vectors by 0 (annihilate them) or 1 (leave them unchanged). There’s no in-between.
eigenvalues = np.linalg.eigvalsh(H)
print("Eigenvalues:", np.sort(eigenvalues)[::-1])
print(f"Number near 1: {np.sum(np.isclose(eigenvalues, 1))} (should be p={p})")
print(f"Number near 0: {np.sum(np.isclose(eigenvalues, 0))} (should be n-p={n - p})")Eigenvalues: [ 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -0. -0. -0. -0. -0. -0.
-0. -0.]
Number near 1: 2 (should be p=2)
Number near 0: 18 (should be n-p=18)
Why does this work?
projects vectors onto a -dimensional subspace. Vectors in the subspace are unchanged (eigenvalue 1). Vectors orthogonal to it are annihilated (eigenvalue 0). There are eigenvalues of 1 and eigenvalues of 0 --- no other values are possible for a projection matrix.
Seeing it in bossanova¶
from bossanova.internal.maths.linalg import qr_solve
result = qr_solve(X, y)
y_hat = X @ result.coef
residuals = y - y_hat
# Verify normal equations
print("X'e (should be ~0):", X.T @ residuals)
print("Coefficients:", result.coef)X'e (should be ~0): [0. 0.]
Coefficients: [2.8534 1.9207]
Summary¶
| Theorem | Statement | What it means |
|---|---|---|
| OLS.1 | Residuals orthogonal to predictors | |
| OLS.4 | Column-wise orthogonality | Each predictor uncorrelated with residuals |
| OLS.5 | Hat matrix is a projection | |
| OLS.6 | Trace equals number of parameters | |
| OLS.7 | Eigenvalues | Binary spectrum of projection |