Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Regression Diagnostics

UCSD Psychology

Fitting a model is only half the job. A model might produce coefficient estimates, confidence intervals, and p-values --- but none of that means much if a handful of observations are unduly warping the results. Diagnostics tell you whether the model is trustworthy by identifying observations that exert outsized influence on the fit.

This page covers the three core tools for spotting problematic observations: leverage (how extreme is a point’s position in predictor space?), studentized residuals (how poorly does the model predict a point, adjusted for its leverage?), and Cook’s distance (how much do the fitted coefficients change if we remove a point?). Together, they form a complete picture of observation-level influence. Every theorem is backed by a property-based test in bossanova.

Leverage: who pulls the hardest?

The problem

Not all observations are created equal. A data point far from the center of the predictor space has more influence on the regression line --- it acts as a fulcrum, pulling the fitted line toward itself. Leverage quantifies exactly how much pull each observation has.

Building intuition

Recall the hat matrix H=X(XX)1X\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top that maps y\mathbf{y} to y^\hat{\mathbf{y}}. The diagonal element hi=Hiih_i = \mathbf{H}_{ii} measures how much observation ii’s yy-value influences its own fitted value y^i\hat{y}_i. Points with high leverage dominate their own prediction.

n = 30
x = np.concatenate([np.random.normal(0, 1, n - 1), [10.0]])  # one extreme point
X = np.column_stack([np.ones(n), x])
H = X @ np.linalg.inv(X.T @ X) @ X.T
leverage = np.diag(H)

print(f"Typical leverage: {np.mean(leverage[:-1]):.4f}")
print(f"Extreme point:    {leverage[-1]:.4f}")
print(f"Average leverage:  {np.mean(leverage):.4f}")
Typical leverage: 0.0408
Extreme point:    0.8166
Average leverage:  0.0667

Leverage sum equals p

Theorem DX.1: hi=p\sum h_i = p where pp is the number of parameters.

p = X.shape[1]
print(f"sum(leverage) = {np.sum(leverage):.4f}")
print(f"p             = {p}")
print(f"Match?          {np.isclose(np.sum(leverage), p)}")
sum(leverage) = 2.0000
p             = 2
Match?          True
Why does this work?

Since hi=tr(H)\sum h_i = \text{tr}(\mathbf{H}) and tr(H)=p\text{tr}(\mathbf{H}) = p (from OLS.6), the total leverage budget is fixed. Each observation “consumes” some of this budget. High-leverage points consume a disproportionate share --- and every unit of leverage they take comes at the expense of the remaining observations.

Leverage bounds

Theorem DX.2: 1/nhi11/n \leq h_i \leq 1 for all observations.

print(f"Min leverage: {np.min(leverage):.4f}  (lower bound 1/n = {1/n:.4f})")
print(f"Max leverage: {np.max(leverage):.4f}  (upper bound = 1)")
print(f"All in [1/n, 1]? {np.all((leverage >= 1/n - 1e-10) & (leverage <= 1 + 1e-10))}")
Min leverage: 0.0333  (lower bound 1/n = 0.0333)
Max leverage: 0.8166  (upper bound = 1)
All in [1/n, 1]? True

The lower bound 1/n1/n means each point gets at least an equal share of the leverage budget. The upper bound 1 means a single point can at most completely determine its own fitted value --- the model passes exactly through it.

Why these bounds?

The lower bound 1/n1/n comes from the intercept column: every observation contributes equally through the intercept. The upper bound 1 comes from H\mathbf{H} being a projection matrix (eigenvalues 0 or 1, so diagonal elements are bounded by 1).


Studentized residuals

The problem

Raw residuals ei=yiy^ie_i = y_i - \hat{y}_i don’t account for the fact that high-leverage points tend to have smaller residuals. This is counterintuitive but makes sense: high-leverage points pull the regression line toward themselves, which shrinks their own residuals. A small raw residual at a high-leverage point might actually be more surprising than a large residual at a typical point.

Building intuition

Studentized residuals adjust each residual by its expected standard deviation, which depends on leverage:

ri=eiσ^1hir_i = \frac{e_i}{\hat{\sigma}\sqrt{1 - h_i}}

The denominator shrinks for high-leverage points (because hih_i is larger), which inflates the standardized residual. This puts all residuals on a comparable scale, making genuine outliers easier to spot.

# Generate data with one outlier
y = 3 + 2 * x + np.random.normal(0, 1, n)
y[-1] = 50  # outlier at the extreme x-value

beta = np.linalg.lstsq(X, y, rcond=None)[0]
residuals = y - X @ beta
sigma = np.sqrt(np.sum(residuals**2) / (n - 2))
studentized = residuals / (sigma * np.sqrt(1 - leverage))

print(f"Raw residual (outlier):         {residuals[-1]:.4f}")
print(f"Studentized residual (outlier): {studentized[-1]:.4f}")
Raw residual (outlier):         5.3180
Studentized residual (outlier): 4.9373

The formal statement

Theorem DX.4: ri=ei/(σ^1hi)r_i = e_i / (\hat{\sigma}\sqrt{1 - h_i})

Why does dividing by 1hi\sqrt{1 - h_i} help?

High-leverage points have Var(ei)=σ2(1hi)\text{Var}(e_i) = \sigma^2(1 - h_i), which is smaller than σ2\sigma^2 because they influence the fit more. Dividing by 1hi\sqrt{1 - h_i} standardizes all residuals to comparable scale, making outliers easier to detect. Without this adjustment, a high-leverage outlier can hide behind its artificially small raw residual.


Cook’s distance: combining leverage and residuals

The problem

A point can be influential in two ways: it can have high leverage (extreme predictor values) or a large residual (doesn’t fit the pattern), or both. Neither measure alone tells the full story. Cook’s distance combines them into a single influence measure.

Building intuition

Di=ei2pσ^2hi(1hi)2D_i = \frac{e_i^2}{p \hat{\sigma}^2} \cdot \frac{h_i}{(1 - h_i)^2}

The first factor captures residual magnitude. The second captures leverage. Large Cook’s distance means the point substantially changes the fitted coefficients when removed. A point needs both a notable residual and some leverage to be truly influential --- a large residual at a low-leverage point barely moves the line, and a high-leverage point that fits the pattern well doesn’t distort the fit.

cooks_d = (residuals**2 / (2 * sigma**2)) * (leverage / (1 - leverage)**2)
print(f"Cook's D (outlier):  {cooks_d[-1]:.4f}")
print(f"Cook's D (typical):  {np.median(cooks_d[:-1]):.4f}")
Cook's D (outlier):  54.2664
Cook's D (typical):  0.0060

The formal statement

Theorem DX.3: Di=ei2pσ^2hi(1hi)2D_i = \frac{e_i^2}{p\hat{\sigma}^2} \cdot \frac{h_i}{(1 - h_i)^2}

What does Cook’s distance measure?

Cook’s distance measures the squared distance between the coefficient vector with and without observation ii, normalized by the covariance. Large residuals OR high leverage can produce large DD. The most influential points combine both --- they sit far from the center of the data and don’t follow the pattern established by the other observations.

Seeing it in bossanova

from bossanova.internal.maths.inference import (
    compute_leverage,
    compute_cooks_distance,
    compute_studentized_residuals,
)

XtX_inv = np.linalg.inv(X.T @ X)
h = compute_leverage(X, weights=None, XtWX_inv=XtX_inv)
mse = np.sum(residuals**2) / (n - X.shape[1])
sigma_hat = np.sqrt(mse)
cooks = compute_cooks_distance(residuals, h, sigma_hat, X.shape[1])
stud_r = compute_studentized_residuals(residuals, h, sigma_hat)

print(f"Max leverage: {np.max(h):.4f} (observation {np.argmax(h)})")
print(f"Max Cook's D: {np.max(cooks):.4f} (observation {np.argmax(cooks)})")
print(f"Max |studentized residual|: {np.max(np.abs(stud_r)):.4f}")
Max leverage: 0.8166 (observation 29)
Max Cook's D: 54.2664 (observation 29)
Max |studentized residual|: 4.9373

Putting it together: identifying influential observations

In practice, you don’t look at one diagnostic in isolation. The standard approach is to flag observations that exceed common thresholds, then investigate the flagged points to decide whether they are data errors, genuinely unusual cases, or signs that the model is misspecified.

# Common diagnostic thresholds
threshold_leverage = 2 * X.shape[1] / n  # 2p/n rule of thumb
threshold_cooks = 4 / n
threshold_resid = 2.0  # |r| > 2

flagged_leverage = np.where(leverage > threshold_leverage)[0]
flagged_cooks = np.where(cooks_d > threshold_cooks)[0]
flagged_resid = np.where(np.abs(studentized) > threshold_resid)[0]

print(f"High leverage (>{threshold_leverage:.2f}): observations {flagged_leverage}")
print(f"High Cook's D (>{threshold_cooks:.2f}):    observations {flagged_cooks}")
print(f"Large residual (>2.0):          observations {flagged_resid}")
High leverage (>0.13): observations [29]
High Cook's D (>0.13):    observations [29]
Large residual (>2.0):          observations [ 6 20 29]

Summary

TheoremStatementWhat it means
DX.1hi=p\sum h_i = pTotal leverage is fixed
DX.21/nhi11/n \leq h_i \leq 1Leverage is bounded
DX.3Cook’s D formulaCombines residual and leverage
DX.4Studentized residual formulaAdjusts residuals for leverage