Ordinary Least Squares

OLS is the workhorse of statistical modeling. When you run a regression --- in any language, any package --- OLS is almost certainly doing the work under the hood. Whether you’re fitting a simple line through a scatterplot or estimating coefficients in a model with dozens of predictors, the same mathematical machinery applies.

This page shows what OLS is actually computing and why the solution has the properties it does. We start with the fitting problem, derive the normal equations, and then explore the hat matrix --- the object that connects OLS to leverage, diagnostics, and degrees of freedom. Every theorem is backed by a property-based test in bossanova.

The fitting problem¶

What does “fitting a line” mean mathematically? We have data $(\mathbf{X}, \mathbf{y})$ and want to find coefficients $\boldsymbol{\beta}$ that minimize the sum of squared residuals:

\min_{\boldsymbol{\beta}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2

(1)

The design matrix $\mathbf{X}$ has one row per observation and one column per predictor (including the intercept column of ones). The response $\mathbf{y}$ is the vector we’re trying to predict.

# Generate a small example: y = 3 + 2*x + noise
n = 20
x = np.random.randn(n)
X = np.column_stack([np.ones(n), x])  # intercept + one predictor
y = 3 + 2 * x + np.random.randn(n) * 0.5

print(f"Design matrix X: {X.shape}  (n={n} observations, p={X.shape[1]} parameters)")
print(f"Response y:      {y.shape}")

Design matrix X: (20, 2)  (n=20 observations, p=2 parameters)
Response y:      (20,)

Normal equations¶

The problem¶

How do we find the $\boldsymbol{\beta}$ that minimizes $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$ ?

Building intuition¶

Taking the derivative of the squared loss with respect to $\boldsymbol{\beta}$ and setting it to zero gives the normal equations:

\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}

(2)

This says: the residuals must be orthogonal to every predictor column. If any column of $\mathbf{X}$ were correlated with the residuals, we could improve the fit by adjusting the corresponding coefficient. At the optimum, there’s no more linear signal left to extract.

# Solve via normal equations directly: beta = inv(X'X) @ X'y
beta = np.linalg.inv(X.T @ X) @ X.T @ y
print("Coefficients (true: [3, 2]):", beta)

# Verify: X'(y - X*beta) should be near zero
residuals = y - X @ beta
print("X'e (should be ~0):", X.T @ residuals)

Coefficients (true: [3, 2]): [2.8534 1.9207]
X'e (should be ~0): [ 0. -0.]

The geometric picture¶

OLS projects $\mathbf{y}$ onto the column space of $\mathbf{X}$ . The fitted values $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ are the closest point in $\text{col}(\mathbf{X})$ to $\mathbf{y}$ . The residuals $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ are perpendicular to $\text{col}(\mathbf{X})$ .

# Solve via QR decomposition (numerically stable)
Q, R = np.linalg.qr(X)
beta_qr = np.linalg.solve(R, Q.T @ y)

residuals = y - X @ beta_qr
print("Coefficients via QR:", beta_qr)
print("X'e (should be ~0):", X.T @ residuals)

Coefficients via QR: [2.8534 1.9207]
X'e (should be ~0): [ 0. -0.]

The formal statement¶

Theorem OLS.1: At the OLS solution, $\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}$ .

Column-by-column orthogonality¶

Theorem OLS.4: Each column of $\mathbf{X}$ is orthogonal to the residuals.

This is OLS.1 stated column-by-column. Instead of checking the whole matrix product at once, we can verify that each individual predictor has zero correlation with the residuals.

# Show that X[:,j].T @ residuals ≈ 0 for each column
for j in range(X.shape[1]):
    dot = X[:, j].T @ residuals
    label = "intercept" if j == 0 else f"x_{j}"
    print(f"  {label}' @ e = {dot:.2e}")

  intercept' @ e = 1.22e-14
  x_1' @ e = -2.46e-15

The hat matrix¶

The problem¶

The hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ “puts a hat on $\mathbf{y}$ ” --- it maps $\mathbf{y}$ to $\hat{\mathbf{y}}$ . Understanding $\mathbf{H}$ is the key to diagnostics, leverage, and degrees of freedom.

Building intuition¶

$\mathbf{H}$ is a projection matrix. It projects any vector onto the column space of $\mathbf{X}$ . The fitted values are exactly $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$ , and the residuals are $\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{y}$ .

# Compute the hat matrix
H = X @ np.linalg.inv(X.T @ X) @ X.T
print(f"Hat matrix shape: {H.shape}  (n x n)")

# Verify: H @ y gives the same fitted values as X @ beta
y_hat_from_H = H @ y
y_hat_from_beta = X @ beta_qr
print("Hy == Xbeta?", np.allclose(y_hat_from_H, y_hat_from_beta))

Hat matrix shape: (20, 20)  (n x n)
Hy == Xbeta? True

Idempotence: H squared equals H¶

Theorem OLS.5: $\mathbf{H}^2 = \mathbf{H}$ (projection property).

Projecting twice onto the same subspace gives the same result as projecting once. Once a vector is already in $\text{col}(\mathbf{X})$ , projecting it again doesn’t move it.

# H @ H should equal H
H_squared = H @ H
print("H^2 == H?", np.allclose(H_squared, H))
print("Max difference:", np.max(np.abs(H_squared - H)))

H^2 == H? True
Max difference: 4.163336342344337e-17

Trace equals number of parameters¶

Theorem OLS.6: $\text{tr}(\mathbf{H}) = p$ .

The trace of the hat matrix equals the number of parameters. This is why the trace shows up in degrees-of-freedom calculations everywhere in statistics.

p = X.shape[1]
print(f"tr(H) = {np.trace(H):.4f}")
print(f"p     = {p}")
print(f"Match? {np.isclose(np.trace(H), p)}")

tr(H) = 2.0000
p     = 2
Match? True

Binary eigenvalues¶

Theorem OLS.7: The eigenvalues of $\mathbf{H}$ are all 0 or 1.

A projection matrix can only scale vectors by 0 (annihilate them) or 1 (leave them unchanged). There’s no in-between.

eigenvalues = np.linalg.eigvalsh(H)
print("Eigenvalues:", np.sort(eigenvalues)[::-1])
print(f"Number near 1: {np.sum(np.isclose(eigenvalues, 1))}  (should be p={p})")
print(f"Number near 0: {np.sum(np.isclose(eigenvalues, 0))}  (should be n-p={n - p})")

Eigenvalues: [ 1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -0. -0. -0. -0. -0. -0.
 -0. -0.]
Number near 1: 2  (should be p=2)
Number near 0: 18  (should be n-p=18)

Seeing it in bossanova¶

from bossanova.internal.maths.linalg import qr_solve

result = qr_solve(X, y)
y_hat = X @ result.coef
residuals = y - y_hat

# Verify normal equations
print("X'e (should be ~0):", X.T @ residuals)
print("Coefficients:", result.coef)

X'e (should be ~0): [0. 0.]
Coefficients: [2.8534 1.9207]

Summary¶

Theorem	Statement	What it means
OLS.1	$\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}$	Residuals orthogonal to predictors
OLS.4	Column-wise orthogonality	Each predictor uncorrelated with residuals
OLS.5	$\mathbf{H}^2 = \mathbf{H}$	Hat matrix is a projection
OLS.6	$\text{tr}(\mathbf{H}) = p$	Trace equals number of parameters
OLS.7	Eigenvalues $\in \{0, 1\}$	Binary spectrum of projection