Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Ordinary Least Squares

UCSD Psychology

OLS is the workhorse of statistical modeling. When you run a regression --- in any language, any package --- OLS is almost certainly doing the work under the hood. Whether you’re fitting a simple line through a scatterplot or estimating coefficients in a model with dozens of predictors, the same mathematical machinery applies.

This page shows what OLS is actually computing and why the solution has the properties it does. We start with the fitting problem, derive the normal equations, and then explore the hat matrix --- the object that connects OLS to leverage, diagnostics, and degrees of freedom. Every theorem is backed by a property-based test in bossanova.

The fitting problem

What does “fitting a line” mean mathematically? We have data (X,y)(\mathbf{X}, \mathbf{y}) and want to find coefficients β\boldsymbol{\beta} that minimize the sum of squared residuals:

minβyXβ2\min_{\boldsymbol{\beta}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2

The design matrix X\mathbf{X} has one row per observation and one column per predictor (including the intercept column of ones). The response y\mathbf{y} is the vector we’re trying to predict.

# Generate a small example: y = 3 + 2*x + noise
n = 20
x = np.random.randn(n)
X = np.column_stack([np.ones(n), x])  # intercept + one predictor
y = 3 + 2 * x + np.random.randn(n) * 0.5

print(f"Design matrix X: {X.shape}  (n={n} observations, p={X.shape[1]} parameters)")
print(f"Response y:      {y.shape}")
Design matrix X: (20, 2)  (n=20 observations, p=2 parameters)
Response y:      (20,)

Normal equations

The problem

How do we find the β\boldsymbol{\beta} that minimizes yXβ2\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2?

Building intuition

Taking the derivative of the squared loss with respect to β\boldsymbol{\beta} and setting it to zero gives the normal equations:

X(yXβ^)=0\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}

This says: the residuals must be orthogonal to every predictor column. If any column of X\mathbf{X} were correlated with the residuals, we could improve the fit by adjusting the corresponding coefficient. At the optimum, there’s no more linear signal left to extract.

# Solve via normal equations directly: beta = inv(X'X) @ X'y
beta = np.linalg.inv(X.T @ X) @ X.T @ y
print("Coefficients (true: [3, 2]):", beta)

# Verify: X'(y - X*beta) should be near zero
residuals = y - X @ beta
print("X'e (should be ~0):", X.T @ residuals)
Coefficients (true: [3, 2]): [2.8534 1.9207]
X'e (should be ~0): [ 0. -0.]

The geometric picture

OLS projects y\mathbf{y} onto the column space of X\mathbf{X}. The fitted values y^=Xβ^\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} are the closest point in col(X)\text{col}(\mathbf{X}) to y\mathbf{y}. The residuals e=yy^\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} are perpendicular to col(X)\text{col}(\mathbf{X}).

# Solve via QR decomposition (numerically stable)
Q, R = np.linalg.qr(X)
beta_qr = np.linalg.solve(R, Q.T @ y)

residuals = y - X @ beta_qr
print("Coefficients via QR:", beta_qr)
print("X'e (should be ~0):", X.T @ residuals)
Coefficients via QR: [2.8534 1.9207]
X'e (should be ~0): [ 0. -0.]

The formal statement

Theorem OLS.1: At the OLS solution, X(yXβ^)=0\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}.

Why does this work?

Geometrically, OLS projects y\mathbf{y} onto col(X)\text{col}(\mathbf{X}). The residual vector yXβ^\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}} lies in the orthogonal complement of col(X)\text{col}(\mathbf{X}), so Xe=0\mathbf{X}^\top\mathbf{e} = \mathbf{0}. This is the first-order optimality condition --- at the minimum, the gradient is zero.

Column-by-column orthogonality

Theorem OLS.4: Each column of X\mathbf{X} is orthogonal to the residuals.

This is OLS.1 stated column-by-column. Instead of checking the whole matrix product at once, we can verify that each individual predictor has zero correlation with the residuals.

# Show that X[:,j].T @ residuals ≈ 0 for each column
for j in range(X.shape[1]):
    dot = X[:, j].T @ residuals
    label = "intercept" if j == 0 else f"x_{j}"
    print(f"  {label}' @ e = {dot:.2e}")
  intercept' @ e = 1.22e-14
  x_1' @ e = -2.46e-15
Why does this matter?

If any column were correlated with the residuals, we could improve the fit by adjusting that coefficient. Orthogonality means we’ve extracted all the linear signal from every predictor. This is what makes OLS optimal among linear estimators.


The hat matrix

The problem

The hat matrix H=X(XX)1X\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top “puts a hat on y\mathbf{y}” --- it maps y\mathbf{y} to y^\hat{\mathbf{y}}. Understanding H\mathbf{H} is the key to diagnostics, leverage, and degrees of freedom.

Building intuition

H\mathbf{H} is a projection matrix. It projects any vector onto the column space of X\mathbf{X}. The fitted values are exactly y^=Hy\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}, and the residuals are e=(IH)y\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{y}.

# Compute the hat matrix
H = X @ np.linalg.inv(X.T @ X) @ X.T
print(f"Hat matrix shape: {H.shape}  (n x n)")

# Verify: H @ y gives the same fitted values as X @ beta
y_hat_from_H = H @ y
y_hat_from_beta = X @ beta_qr
print("Hy == Xbeta?", np.allclose(y_hat_from_H, y_hat_from_beta))
Hat matrix shape: (20, 20)  (n x n)
Hy == Xbeta? True

Idempotence: H squared equals H

Theorem OLS.5: H2=H\mathbf{H}^2 = \mathbf{H} (projection property).

Projecting twice onto the same subspace gives the same result as projecting once. Once a vector is already in col(X)\text{col}(\mathbf{X}), projecting it again doesn’t move it.

# H @ H should equal H
H_squared = H @ H
print("H^2 == H?", np.allclose(H_squared, H))
print("Max difference:", np.max(np.abs(H_squared - H)))
H^2 == H? True
Max difference: 4.163336342344337e-17
Why does this work?

If y^=Hy\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}, then Hy^=H(Hy)=H2y=Hy=y^\mathbf{H}\hat{\mathbf{y}} = \mathbf{H}(\mathbf{H}\mathbf{y}) = \mathbf{H}^2\mathbf{y} = \mathbf{H}\mathbf{y} = \hat{\mathbf{y}}. Projecting the fitted values doesn’t change them --- they’re already in the subspace.

Trace equals number of parameters

Theorem OLS.6: tr(H)=p\text{tr}(\mathbf{H}) = p.

The trace of the hat matrix equals the number of parameters. This is why the trace shows up in degrees-of-freedom calculations everywhere in statistics.

p = X.shape[1]
print(f"tr(H) = {np.trace(H):.4f}")
print(f"p     = {p}")
print(f"Match? {np.isclose(np.trace(H), p)}")
tr(H) = 2.0000
p     = 2
Match? True
Why does this work?

The trace of a projection matrix equals the dimension of the subspace it projects onto. H\mathbf{H} projects onto col(X)\text{col}(\mathbf{X}), which has dimension pp (when X\mathbf{X} has full column rank). So tr(H)=p\text{tr}(\mathbf{H}) = p.

Binary eigenvalues

Theorem OLS.7: The eigenvalues of H\mathbf{H} are all 0 or 1.

A projection matrix can only scale vectors by 0 (annihilate them) or 1 (leave them unchanged). There’s no in-between.

eigenvalues = np.linalg.eigvalsh(H)
print("Eigenvalues:", np.sort(eigenvalues)[::-1])
print(f"Number near 1: {np.sum(np.isclose(eigenvalues, 1))}  (should be p={p})")
print(f"Number near 0: {np.sum(np.isclose(eigenvalues, 0))}  (should be n-p={n - p})")
Eigenvalues: [ 1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -0. -0. -0. -0. -0. -0.
 -0. -0.]
Number near 1: 2  (should be p=2)
Number near 0: 18  (should be n-p=18)
Why does this work?

H\mathbf{H} projects vectors onto a pp-dimensional subspace. Vectors in the subspace are unchanged (eigenvalue 1). Vectors orthogonal to it are annihilated (eigenvalue 0). There are pp eigenvalues of 1 and (np)(n - p) eigenvalues of 0 --- no other values are possible for a projection matrix.

Seeing it in bossanova

from bossanova.internal.maths.linalg import qr_solve

result = qr_solve(X, y)
y_hat = X @ result.coef
residuals = y - y_hat

# Verify normal equations
print("X'e (should be ~0):", X.T @ residuals)
print("Coefficients:", result.coef)
X'e (should be ~0): [0. 0.]
Coefficients: [2.8534 1.9207]

Summary

TheoremStatementWhat it means
OLS.1X(yXβ^)=0\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}Residuals orthogonal to predictors
OLS.4Column-wise orthogonalityEach predictor uncorrelated with residuals
OLS.5H2=H\mathbf{H}^2 = \mathbf{H}Hat matrix is a projection
OLS.6tr(H)=p\text{tr}(\mathbf{H}) = pTrace equals number of parameters
OLS.7Eigenvalues {0,1}\in \{0, 1\}Binary spectrum of projection