Fitting a model is only half the job. A model might produce coefficient estimates, confidence intervals, and p-values --- but none of that means much if a handful of observations are unduly warping the results. Diagnostics tell you whether the model is trustworthy by identifying observations that exert outsized influence on the fit.
This page covers the three core tools for spotting problematic observations: leverage (how extreme is a point’s position in predictor space?), studentized residuals (how poorly does the model predict a point, adjusted for its leverage?), and Cook’s distance (how much do the fitted coefficients change if we remove a point?). Together, they form a complete picture of observation-level influence. Every theorem is backed by a property-based test in bossanova.
Leverage: who pulls the hardest?¶
The problem¶
Not all observations are created equal. A data point far from the center of the predictor space has more influence on the regression line --- it acts as a fulcrum, pulling the fitted line toward itself. Leverage quantifies exactly how much pull each observation has.
Building intuition¶
Recall the hat matrix that maps to . The diagonal element measures how much observation ’s -value influences its own fitted value . Points with high leverage dominate their own prediction.
n = 30
x = np.concatenate([np.random.normal(0, 1, n - 1), [10.0]]) # one extreme point
X = np.column_stack([np.ones(n), x])
H = X @ np.linalg.inv(X.T @ X) @ X.T
leverage = np.diag(H)
print(f"Typical leverage: {np.mean(leverage[:-1]):.4f}")
print(f"Extreme point: {leverage[-1]:.4f}")
print(f"Average leverage: {np.mean(leverage):.4f}")Typical leverage: 0.0408
Extreme point: 0.8166
Average leverage: 0.0667
Leverage sum equals p¶
Theorem DX.1: where is the number of parameters.
p = X.shape[1]
print(f"sum(leverage) = {np.sum(leverage):.4f}")
print(f"p = {p}")
print(f"Match? {np.isclose(np.sum(leverage), p)}")sum(leverage) = 2.0000
p = 2
Match? True
Why does this work?
Since and (from OLS.6), the total leverage budget is fixed. Each observation “consumes” some of this budget. High-leverage points consume a disproportionate share --- and every unit of leverage they take comes at the expense of the remaining observations.
Leverage bounds¶
Theorem DX.2: for all observations.
print(f"Min leverage: {np.min(leverage):.4f} (lower bound 1/n = {1/n:.4f})")
print(f"Max leverage: {np.max(leverage):.4f} (upper bound = 1)")
print(f"All in [1/n, 1]? {np.all((leverage >= 1/n - 1e-10) & (leverage <= 1 + 1e-10))}")Min leverage: 0.0333 (lower bound 1/n = 0.0333)
Max leverage: 0.8166 (upper bound = 1)
All in [1/n, 1]? True
The lower bound means each point gets at least an equal share of the leverage budget. The upper bound 1 means a single point can at most completely determine its own fitted value --- the model passes exactly through it.
Why these bounds?
The lower bound comes from the intercept column: every observation contributes equally through the intercept. The upper bound 1 comes from being a projection matrix (eigenvalues 0 or 1, so diagonal elements are bounded by 1).
Studentized residuals¶
The problem¶
Raw residuals don’t account for the fact that high-leverage points tend to have smaller residuals. This is counterintuitive but makes sense: high-leverage points pull the regression line toward themselves, which shrinks their own residuals. A small raw residual at a high-leverage point might actually be more surprising than a large residual at a typical point.
Building intuition¶
Studentized residuals adjust each residual by its expected standard deviation, which depends on leverage:
The denominator shrinks for high-leverage points (because is larger), which inflates the standardized residual. This puts all residuals on a comparable scale, making genuine outliers easier to spot.
# Generate data with one outlier
y = 3 + 2 * x + np.random.normal(0, 1, n)
y[-1] = 50 # outlier at the extreme x-value
beta = np.linalg.lstsq(X, y, rcond=None)[0]
residuals = y - X @ beta
sigma = np.sqrt(np.sum(residuals**2) / (n - 2))
studentized = residuals / (sigma * np.sqrt(1 - leverage))
print(f"Raw residual (outlier): {residuals[-1]:.4f}")
print(f"Studentized residual (outlier): {studentized[-1]:.4f}")Raw residual (outlier): 5.3180
Studentized residual (outlier): 4.9373
The formal statement¶
Theorem DX.4:
Why does dividing by help?
High-leverage points have , which is smaller than because they influence the fit more. Dividing by standardizes all residuals to comparable scale, making outliers easier to detect. Without this adjustment, a high-leverage outlier can hide behind its artificially small raw residual.
Cook’s distance: combining leverage and residuals¶
The problem¶
A point can be influential in two ways: it can have high leverage (extreme predictor values) or a large residual (doesn’t fit the pattern), or both. Neither measure alone tells the full story. Cook’s distance combines them into a single influence measure.
Building intuition¶
The first factor captures residual magnitude. The second captures leverage. Large Cook’s distance means the point substantially changes the fitted coefficients when removed. A point needs both a notable residual and some leverage to be truly influential --- a large residual at a low-leverage point barely moves the line, and a high-leverage point that fits the pattern well doesn’t distort the fit.
cooks_d = (residuals**2 / (2 * sigma**2)) * (leverage / (1 - leverage)**2)
print(f"Cook's D (outlier): {cooks_d[-1]:.4f}")
print(f"Cook's D (typical): {np.median(cooks_d[:-1]):.4f}")Cook's D (outlier): 54.2664
Cook's D (typical): 0.0060
The formal statement¶
Theorem DX.3:
What does Cook’s distance measure?
Cook’s distance measures the squared distance between the coefficient vector with and without observation , normalized by the covariance. Large residuals OR high leverage can produce large . The most influential points combine both --- they sit far from the center of the data and don’t follow the pattern established by the other observations.
Seeing it in bossanova¶
from bossanova.internal.maths.inference import (
compute_leverage,
compute_cooks_distance,
compute_studentized_residuals,
)
XtX_inv = np.linalg.inv(X.T @ X)
h = compute_leverage(X, weights=None, XtWX_inv=XtX_inv)
mse = np.sum(residuals**2) / (n - X.shape[1])
sigma_hat = np.sqrt(mse)
cooks = compute_cooks_distance(residuals, h, sigma_hat, X.shape[1])
stud_r = compute_studentized_residuals(residuals, h, sigma_hat)
print(f"Max leverage: {np.max(h):.4f} (observation {np.argmax(h)})")
print(f"Max Cook's D: {np.max(cooks):.4f} (observation {np.argmax(cooks)})")
print(f"Max |studentized residual|: {np.max(np.abs(stud_r)):.4f}")Max leverage: 0.8166 (observation 29)
Max Cook's D: 54.2664 (observation 29)
Max |studentized residual|: 4.9373
Putting it together: identifying influential observations¶
In practice, you don’t look at one diagnostic in isolation. The standard approach is to flag observations that exceed common thresholds, then investigate the flagged points to decide whether they are data errors, genuinely unusual cases, or signs that the model is misspecified.
# Common diagnostic thresholds
threshold_leverage = 2 * X.shape[1] / n # 2p/n rule of thumb
threshold_cooks = 4 / n
threshold_resid = 2.0 # |r| > 2
flagged_leverage = np.where(leverage > threshold_leverage)[0]
flagged_cooks = np.where(cooks_d > threshold_cooks)[0]
flagged_resid = np.where(np.abs(studentized) > threshold_resid)[0]
print(f"High leverage (>{threshold_leverage:.2f}): observations {flagged_leverage}")
print(f"High Cook's D (>{threshold_cooks:.2f}): observations {flagged_cooks}")
print(f"Large residual (>2.0): observations {flagged_resid}")High leverage (>0.13): observations [29]
High Cook's D (>0.13): observations [29]
Large residual (>2.0): observations [ 6 20 29]
Summary¶
| Theorem | Statement | What it means |
|---|---|---|
| DX.1 | Total leverage is fixed | |
| DX.2 | Leverage is bounded | |
| DX.3 | Cook’s D formula | Combines residual and leverage |
| DX.4 | Studentized residual formula | Adjusts residuals for leverage |