Fitting a model gives you coefficient estimates, but estimates alone don’t tell you much.
If you fit y ~ x and get , your first question should be: how
uncertain is that number? Could it easily be 0.10 or 0.80, or is 0.42 tightly pinned
down by the data?
Statistical inference provides the machinery for answering these questions. Standard errors quantify how much each coefficient would bounce around if you repeated the experiment. Confidence intervals give plausible ranges for the true value. Hypothesis tests formalize “is this effect real, or could chance explain it?” This page shows where those numbers come from, building from variance-covariance matrices up through the Wald and F-tests.
From variance-covariance to standard errors¶
The problem¶
After fitting , we get . But how precise is each coefficient? A coefficient of 0.42 means something very different if its standard error is 0.02 versus 2.0.
Building intuition¶
The variance-covariance matrix captures how much each coefficient varies across hypothetical repeated samples, and how coefficients co-vary with each other. The diagonal entries are the variances of each coefficient. The standard errors are just the square roots of those diagonal entries.
# Small regression: n=50, 3 predictors + intercept
n, p = 50, 4
X = np.column_stack([np.ones(n), np.random.randn(n, 3)])
true_beta = np.array([2.0, 0.5, -1.0, 0.3])
y = X @ true_beta + np.random.randn(n) * 1.5
# Fit via QR decomposition
Q, R = np.linalg.qr(X)
beta_hat = np.linalg.solve(R, Q.T @ y)
# Residual variance: sigma^2 = RSS / (n - p)
residuals = y - X @ beta_hat
sigma2 = np.sum(residuals**2) / (n - p)
print(f"sigma^2 = {sigma2:.4f}")
# Variance-covariance matrix: V = sigma^2 * (X'X)^-1
vcov = sigma2 * np.linalg.inv(X.T @ X)
print(f"\nVariance-covariance matrix V:\n{vcov}")
# Standard errors: sqrt of diagonal
se = np.sqrt(np.diag(vcov))
print(f"\nStandard errors: {se}")sigma^2 = 1.8308
Variance-covariance matrix V:
[[0.0378 0.0054 0.0049 0.0041]
[0.0054 0.072 0.0089 0.0121]
[0.0049 0.0089 0.0377 0.003 ]
[0.0041 0.0121 0.003 0.0357]]
Standard errors: [0.1944 0.2683 0.1942 0.189 ]
The formal statement¶
Theorem INF.1: where is the variance-covariance matrix of the coefficient estimates.
Why does this work?
The diagonal element is the variance of the -th coefficient estimate across hypothetical repeated samples. Taking the square root gives the standard deviation of the sampling distribution of , which is the standard error. The off-diagonal elements capture correlations between coefficients but do not directly affect individual standard errors.
Seeing it in bossanova¶
from bossanova.internal.maths.inference import compute_se_from_vcov
vcov = sigma2 * np.linalg.inv(X.T @ X)
se = compute_se_from_vcov(vcov)
print("Standard errors:", se)Standard errors: [0.1944 0.2683 0.1942 0.189 ]
Confidence intervals¶
The problem¶
A point estimate is a single number. We want a range that has a known probability of containing the true value. But where does that range come from?
Building intuition¶
The confidence interval is . The critical value comes from the -distribution with degrees of freedom. For 95% confidence with large samples, it’s approximately 1.96 (the familiar z-value), but for smaller samples the -distribution has heavier tails, making the interval wider.
# 95% confidence intervals for each coefficient
alpha = 0.05
df = n - p
t_crit = stats.t.ppf(1 - alpha / 2, df)
print(f"t-critical (df={df}): {t_crit:.4f}")
ci_lo = beta_hat - t_crit * se
ci_hi = beta_hat + t_crit * se
print("\nCoefficient Estimate CI_lo CI_hi")
for j in range(p):
print(f" beta_{j}: {beta_hat[j]:>7.4f} {ci_lo[j]:>7.4f} {ci_hi[j]:>7.4f}")
# Verify midpoint equals estimate
midpoint = (ci_lo + ci_hi) / 2
print(f"\nMidpoint == beta_hat? {np.allclose(midpoint, beta_hat)}")t-critical (df=46): 2.0129
Coefficient Estimate CI_lo CI_hi
beta_0: 2.1118 1.7206 2.5030
beta_1: 0.7009 0.1609 1.2409
beta_2: -1.2111 -1.6020 -0.8202
beta_3: 0.3008 -0.0797 0.6812
Midpoint == beta_hat? True
The formal statement¶
Theorem INF.2: -- confidence intervals are centered at the point estimate.
Why is this true?
The CI is constructed as . The midpoint of is exactly . This holds for any symmetric reference distribution ( or ) and any confidence level.
Width tells you precision¶
# CI width = 2 * t_crit * SE
ci_width = ci_hi - ci_lo
expected_width = 2 * t_crit * se
print("Coefficient SE CI width 2*t*SE")
for j in range(p):
print(f" beta_{j}: {se[j]:.4f} {ci_width[j]:.4f} {expected_width[j]:.4f}")
print(f"\nWidth == 2*t*SE? {np.allclose(ci_width, expected_width)}")Coefficient SE CI width 2*t*SE
beta_0: 0.1944 0.7824 0.7824
beta_1: 0.2683 1.0800 1.0800
beta_2: 0.1942 0.7818 0.7818
beta_3: 0.1890 0.7609 0.7609
Width == 2*t*SE? True
Wider intervals mean less precise estimates. A coefficient with a wide CI relative to its magnitude is poorly determined by the data.
Seeing it in bossanova¶
from bossanova.internal.maths.inference import compute_ci
t_crit = stats.t.ppf(1 - 0.05 / 2, df)
ci_lo, ci_hi = compute_ci(beta_hat, se, critical=t_crit)
for j in range(len(beta_hat)):
print(f" beta_{j}: {beta_hat[j]:.4f} [{ci_lo[j]:.4f}, {ci_hi[j]:.4f}]") beta_0: 2.1118 [1.7206, 2.5030]
beta_1: 0.7009 [0.1609, 1.2409]
beta_2: -1.2111 [-1.6020, -0.8202]
beta_3: 0.3008 [-0.0797, 0.6812]
The Wald statistic¶
The problem¶
We want to test whether a set of coefficients (a “contrast”) is significantly different from zero. For a single coefficient, the -statistic works. But what if we want to test multiple coefficients simultaneously?
Building intuition¶
For a single coefficient, the -statistic is . For multiple coefficients simultaneously, we need the Wald statistic, which generalizes this idea to account for correlations between estimates. The Wald statistic uses a contrast matrix that picks out which linear combinations of coefficients to test.
# Single coefficient: t-statistic
t_stat = beta_hat[1] / se[1]
print(f"t-statistic for beta_1: {t_stat:.4f}")
print(f"t^2: {t_stat**2:.4f}")
# Same thing via Wald formula: W = (L @ beta)' @ inv(L @ V @ L') @ (L @ beta)
L_single = np.eye(p)[1:2] # Picks out beta_1
Lb = L_single @ beta_hat # = [beta_1]
LVL = L_single @ vcov @ L_single.T # = Var(beta_1), a 1x1 matrix
W_single = Lb @ np.linalg.inv(LVL) @ Lb
print(f"\nWald statistic (q=1): {W_single:.4f}")
print(f"Matches t^2? {np.isclose(W_single, t_stat**2)}")
# Multivariate: test beta_1 and beta_2 jointly
L_multi = np.eye(p)[1:3] # Picks out beta_1 and beta_2
Lb_multi = L_multi @ beta_hat
LVL_multi = L_multi @ vcov @ L_multi.T
W_multi = Lb_multi @ np.linalg.inv(LVL_multi) @ Lb_multi
print(f"\nWald statistic (q=2): {W_multi:.4f}")t-statistic for beta_1: 2.6127
t^2: 6.8264
Wald statistic (q=1): 6.8264
Matches t^2? True
Wald statistic (q=2): 52.8611
The formal statement¶
Theorem INF.4:
What does this measure?
The Wald statistic is a quadratic form that measures how many standard errors the contrast estimates are from zero, accounting for their correlations through the variance-covariance structure. For a single contrast (), this reduces to . For multiple contrasts, the matrix inverse “decorrelates” the estimates before measuring their distance from zero.
# Verify: Wald with q=1 is exactly t^2
L = np.eye(p)[1:2]
Lb = L @ beta_hat
LVL = L @ vcov @ L.T
W = Lb @ np.linalg.inv(LVL) @ Lb
t2 = (beta_hat[1] / se[1]) ** 2
print(f"Wald (q=1): {W:.6f}")
print(f"t^2: {t2:.6f}")
print(f"Match: {np.isclose(W, t2)}")Wald (q=1): 6.826353
t^2: 6.826353
Match: True
Seeing it in bossanova¶
from bossanova.internal.maths.inference import (
compute_wald_statistic,
compute_contrast_variance,
)
L = np.eye(len(beta_hat))[1:2] # Test second coefficient
contrast_values = L @ beta_hat
contrast_vcov = compute_contrast_variance(L, vcov)
W = compute_wald_statistic(contrast_values, contrast_vcov)
print(f"Wald statistic: {W:.4f}")
print(f"t^2 (should match): {(beta_hat[1] / se[1])**2:.4f}")Wald statistic: 6.8264
t^2 (should match): 6.8264
From Wald to F-test¶
The problem¶
The Wald statistic has a chi-squared distribution asymptotically. But for finite samples (which is most real data), the chi-squared approximation can be too liberal. The -distribution provides a finite-sample correction.
Building intuition¶
, where is the number of constraints being tested (the rank of ). This normalization accounts for the fact that testing more constraints simultaneously will naturally produce a larger Wald statistic, so we need a higher threshold.
# q=1: F = W/1 = t^2
L1 = np.eye(p)[1:2]
Lb1 = L1 @ beta_hat
LVL1 = L1 @ vcov @ L1.T
W1 = Lb1 @ np.linalg.inv(LVL1) @ Lb1
q1 = L1.shape[0]
F1 = W1 / q1
t_sq = (beta_hat[1] / se[1]) ** 2
print(f"q=1: W={W1:.4f}, F=W/q={F1:.4f}, t^2={t_sq:.4f}")
print(f"F == t^2? {np.isclose(F1, t_sq)}")
# q=2: test beta_1 and beta_2 jointly
L2 = np.eye(p)[1:3]
Lb2 = L2 @ beta_hat
LVL2 = L2 @ vcov @ L2.T
W2 = Lb2 @ np.linalg.inv(LVL2) @ Lb2
q2 = L2.shape[0]
F2 = W2 / q2
p_val = 1 - stats.f.cdf(F2, dfn=q2, dfd=df)
print(f"\nq=2: W={W2:.4f}, F=W/q={F2:.4f}, p={p_val:.4f}")q=1: W=6.8264, F=W/q=6.8264, t^2=6.8264
F == t^2? True
q=2: W=52.8611, F=W/q=26.4305, p=0.0000
The formal statement¶
Theorem INF.5: where .
Distribution theory
The Wald statistic asymptotically. Dividing by gives for finite samples. The -distribution accounts for the estimation uncertainty in . For large residual degrees of freedom, because the -distribution converges to the chi-squared distribution (scaled by ) as the denominator df goes to infinity.
t vs z: when do they converge?¶
# For large df, the t-distribution is nearly identical to the normal
print(" df t_crit(95%) z_crit(95%) difference")
print(" " + "-" * 50)
z_crit = stats.norm.ppf(0.975)
for test_df in [5, 10, 30, 60, 120, 500, 10000]:
t_crit_val = stats.t.ppf(0.975, test_df)
diff = t_crit_val - z_crit
print(f" {test_df:<6} {t_crit_val:.6f} {z_crit:.6f} {diff:+.6f}") df t_crit(95%) z_crit(95%) difference
--------------------------------------------------
5 2.570582 1.959964 +0.610618
10 2.228139 1.959964 +0.268175
30 2.042272 1.959964 +0.082308
60 2.000298 1.959964 +0.040334
120 1.979930 1.959964 +0.019966
500 1.964720 1.959964 +0.004756
10000 1.960201 1.959964 +0.000237
For degrees of freedom above about 100, the and critical values are nearly identical. This is why large-sample methods (which use the normal distribution) and finite-sample methods (which use the -distribution) give very similar results in practice when is large.
Summary¶
| Theorem | Statement | What it means |
|---|---|---|
| INF.1 | SE = | Standard errors from variance-covariance |
| INF.2 | Midpoint(CI) = | Confidence intervals are symmetric |
| INF.4 | Wald statistic for contrasts | |
| INF.5 | F-test from Wald statistic |