Statistical Inference

Fitting a model gives you coefficient estimates, but estimates alone don’t tell you much. If you fit y ~ x and get $\hat{\beta}_1 = 0.42$ , your first question should be: how uncertain is that number? Could it easily be 0.10 or 0.80, or is 0.42 tightly pinned down by the data?

Statistical inference provides the machinery for answering these questions. Standard errors quantify how much each coefficient would bounce around if you repeated the experiment. Confidence intervals give plausible ranges for the true value. Hypothesis tests formalize “is this effect real, or could chance explain it?” This page shows where those numbers come from, building from variance-covariance matrices up through the Wald and F-tests.

From variance-covariance to standard errors¶

The problem¶

After fitting $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ , we get $\hat{\boldsymbol{\beta}}$ . But how precise is each coefficient? A coefficient of 0.42 means something very different if its standard error is 0.02 versus 2.0.

Building intuition¶

The variance-covariance matrix $\mathbf{V} = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$ captures how much each coefficient varies across hypothetical repeated samples, and how coefficients co-vary with each other. The diagonal entries are the variances of each coefficient. The standard errors are just the square roots of those diagonal entries.

# Small regression: n=50, 3 predictors + intercept
n, p = 50, 4
X = np.column_stack([np.ones(n), np.random.randn(n, 3)])
true_beta = np.array([2.0, 0.5, -1.0, 0.3])
y = X @ true_beta + np.random.randn(n) * 1.5

# Fit via QR decomposition
Q, R = np.linalg.qr(X)
beta_hat = np.linalg.solve(R, Q.T @ y)

# Residual variance: sigma^2 = RSS / (n - p)
residuals = y - X @ beta_hat
sigma2 = np.sum(residuals**2) / (n - p)
print(f"sigma^2 = {sigma2:.4f}")

# Variance-covariance matrix: V = sigma^2 * (X'X)^-1
vcov = sigma2 * np.linalg.inv(X.T @ X)
print(f"\nVariance-covariance matrix V:\n{vcov}")

# Standard errors: sqrt of diagonal
se = np.sqrt(np.diag(vcov))
print(f"\nStandard errors: {se}")

sigma^2 = 1.8308

Variance-covariance matrix V:
[[0.0378 0.0054 0.0049 0.0041]
 [0.0054 0.072  0.0089 0.0121]
 [0.0049 0.0089 0.0377 0.003 ]
 [0.0041 0.0121 0.003  0.0357]]

Standard errors: [0.1944 0.2683 0.1942 0.189 ]

The formal statement¶

Theorem INF.1: $\text{SE}_j = \sqrt{\text{diag}(\mathbf{V})_j}$ where $\mathbf{V} = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$ is the variance-covariance matrix of the coefficient estimates.

Seeing it in bossanova¶

from bossanova.internal.maths.inference import compute_se_from_vcov

vcov = sigma2 * np.linalg.inv(X.T @ X)
se = compute_se_from_vcov(vcov)
print("Standard errors:", se)

Standard errors: [0.1944 0.2683 0.1942 0.189 ]

Confidence intervals¶

The problem¶

A point estimate $\hat{\beta}_j$ is a single number. We want a range that has a known probability of containing the true value. But where does that range come from?

Building intuition¶

The confidence interval is $\hat{\beta}_j \pm t_{\text{crit}} \times \text{SE}_j$ . The critical value $t_{\text{crit}}$ comes from the $t$ -distribution with $n - p$ degrees of freedom. For 95% confidence with large samples, it’s approximately 1.96 (the familiar z-value), but for smaller samples the $t$ -distribution has heavier tails, making the interval wider.

# 95% confidence intervals for each coefficient
alpha = 0.05
df = n - p
t_crit = stats.t.ppf(1 - alpha / 2, df)
print(f"t-critical (df={df}): {t_crit:.4f}")

ci_lo = beta_hat - t_crit * se
ci_hi = beta_hat + t_crit * se

print("\nCoefficient   Estimate    CI_lo      CI_hi")
for j in range(p):
    print(f"  beta_{j}:     {beta_hat[j]:>7.4f}    {ci_lo[j]:>7.4f}    {ci_hi[j]:>7.4f}")

# Verify midpoint equals estimate
midpoint = (ci_lo + ci_hi) / 2
print(f"\nMidpoint == beta_hat? {np.allclose(midpoint, beta_hat)}")

t-critical (df=46): 2.0129

Coefficient   Estimate    CI_lo      CI_hi
  beta_0:      2.1118     1.7206     2.5030
  beta_1:      0.7009     0.1609     1.2409
  beta_2:     -1.2111    -1.6020    -0.8202
  beta_3:      0.3008    -0.0797     0.6812

Midpoint == beta_hat? True

The formal statement¶

Theorem INF.2: $(\text{CI}_{\text{lo}} + \text{CI}_{\text{hi}})/2 = \hat{\beta}$ -- confidence intervals are centered at the point estimate.

Width tells you precision¶

# CI width = 2 * t_crit * SE
ci_width = ci_hi - ci_lo
expected_width = 2 * t_crit * se

print("Coefficient   SE        CI width   2*t*SE")
for j in range(p):
    print(f"  beta_{j}:     {se[j]:.4f}    {ci_width[j]:.4f}     {expected_width[j]:.4f}")

print(f"\nWidth == 2*t*SE? {np.allclose(ci_width, expected_width)}")

Coefficient   SE        CI width   2*t*SE
  beta_0:     0.1944    0.7824     0.7824
  beta_1:     0.2683    1.0800     1.0800
  beta_2:     0.1942    0.7818     0.7818
  beta_3:     0.1890    0.7609     0.7609

Width == 2*t*SE? True

Wider intervals mean less precise estimates. A coefficient with a wide CI relative to its magnitude is poorly determined by the data.

Seeing it in bossanova¶

from bossanova.internal.maths.inference import compute_ci

t_crit = stats.t.ppf(1 - 0.05 / 2, df)
ci_lo, ci_hi = compute_ci(beta_hat, se, critical=t_crit)
for j in range(len(beta_hat)):
    print(f"  beta_{j}: {beta_hat[j]:.4f}  [{ci_lo[j]:.4f}, {ci_hi[j]:.4f}]")

  beta_0: 2.1118  [1.7206, 2.5030]
  beta_1: 0.7009  [0.1609, 1.2409]
  beta_2: -1.2111  [-1.6020, -0.8202]
  beta_3: 0.3008  [-0.0797, 0.6812]

The Wald statistic¶

The problem¶

We want to test whether a set of coefficients (a “contrast”) is significantly different from zero. For a single coefficient, the $t$ -statistic works. But what if we want to test multiple coefficients simultaneously?

Building intuition¶

For a single coefficient, the $t$ -statistic is $t = \hat{\beta}_j / \text{SE}_j$ . For multiple coefficients simultaneously, we need the Wald statistic, which generalizes this idea to account for correlations between estimates. The Wald statistic uses a contrast matrix $\mathbf{L}$ that picks out which linear combinations of coefficients to test.

# Single coefficient: t-statistic
t_stat = beta_hat[1] / se[1]
print(f"t-statistic for beta_1: {t_stat:.4f}")
print(f"t^2:                    {t_stat**2:.4f}")

# Same thing via Wald formula: W = (L @ beta)' @ inv(L @ V @ L') @ (L @ beta)
L_single = np.eye(p)[1:2]            # Picks out beta_1
Lb = L_single @ beta_hat             # = [beta_1]
LVL = L_single @ vcov @ L_single.T   # = Var(beta_1), a 1x1 matrix
W_single = Lb @ np.linalg.inv(LVL) @ Lb
print(f"\nWald statistic (q=1):   {W_single:.4f}")
print(f"Matches t^2?            {np.isclose(W_single, t_stat**2)}")

# Multivariate: test beta_1 and beta_2 jointly
L_multi = np.eye(p)[1:3]             # Picks out beta_1 and beta_2
Lb_multi = L_multi @ beta_hat
LVL_multi = L_multi @ vcov @ L_multi.T
W_multi = Lb_multi @ np.linalg.inv(LVL_multi) @ Lb_multi
print(f"\nWald statistic (q=2):   {W_multi:.4f}")

t-statistic for beta_1: 2.6127
t^2:                    6.8264

Wald statistic (q=1):   6.8264
Matches t^2?            True

Wald statistic (q=2):   52.8611

The formal statement¶

Theorem INF.4: $W = (\mathbf{L}\hat{\boldsymbol{\beta}})^\top (\mathbf{L}\mathbf{V}\mathbf{L}^\top)^{-1} (\mathbf{L}\hat{\boldsymbol{\beta}})$

# Verify: Wald with q=1 is exactly t^2
L = np.eye(p)[1:2]
Lb = L @ beta_hat
LVL = L @ vcov @ L.T
W = Lb @ np.linalg.inv(LVL) @ Lb
t2 = (beta_hat[1] / se[1]) ** 2
print(f"Wald (q=1): {W:.6f}")
print(f"t^2:        {t2:.6f}")
print(f"Match:      {np.isclose(W, t2)}")

Wald (q=1): 6.826353
t^2:        6.826353
Match:      True

Seeing it in bossanova¶

from bossanova.internal.maths.inference import (
    compute_wald_statistic,
    compute_contrast_variance,
)

L = np.eye(len(beta_hat))[1:2]  # Test second coefficient
contrast_values = L @ beta_hat
contrast_vcov = compute_contrast_variance(L, vcov)
W = compute_wald_statistic(contrast_values, contrast_vcov)
print(f"Wald statistic: {W:.4f}")
print(f"t^2 (should match): {(beta_hat[1] / se[1])**2:.4f}")

Wald statistic: 6.8264
t^2 (should match): 6.8264

From Wald to F-test¶

The problem¶

The Wald statistic has a chi-squared distribution asymptotically. But for finite samples (which is most real data), the chi-squared approximation can be too liberal. The $F$ -distribution provides a finite-sample correction.

Building intuition¶

$F = W / q$ , where $q$ is the number of constraints being tested (the rank of $\mathbf{L}$ ). This normalization accounts for the fact that testing more constraints simultaneously will naturally produce a larger Wald statistic, so we need a higher threshold.

# q=1: F = W/1 = t^2
L1 = np.eye(p)[1:2]
Lb1 = L1 @ beta_hat
LVL1 = L1 @ vcov @ L1.T
W1 = Lb1 @ np.linalg.inv(LVL1) @ Lb1
q1 = L1.shape[0]
F1 = W1 / q1
t_sq = (beta_hat[1] / se[1]) ** 2

print(f"q=1:  W={W1:.4f},  F=W/q={F1:.4f},  t^2={t_sq:.4f}")
print(f"F == t^2? {np.isclose(F1, t_sq)}")

# q=2: test beta_1 and beta_2 jointly
L2 = np.eye(p)[1:3]
Lb2 = L2 @ beta_hat
LVL2 = L2 @ vcov @ L2.T
W2 = Lb2 @ np.linalg.inv(LVL2) @ Lb2
q2 = L2.shape[0]
F2 = W2 / q2
p_val = 1 - stats.f.cdf(F2, dfn=q2, dfd=df)

print(f"\nq=2:  W={W2:.4f},  F=W/q={F2:.4f},  p={p_val:.4f}")

q=1:  W=6.8264,  F=W/q=6.8264,  t^2=6.8264
F == t^2? True

q=2:  W=52.8611,  F=W/q=26.4305,  p=0.0000

The formal statement¶

Theorem INF.5: $F = W / q$ where $q = \text{rank}(\mathbf{L})$ .

t vs z: when do they converge?¶

# For large df, the t-distribution is nearly identical to the normal
print("  df     t_crit(95%)   z_crit(95%)   difference")
print("  " + "-" * 50)
z_crit = stats.norm.ppf(0.975)
for test_df in [5, 10, 30, 60, 120, 500, 10000]:
    t_crit_val = stats.t.ppf(0.975, test_df)
    diff = t_crit_val - z_crit
    print(f"  {test_df:<6} {t_crit_val:.6f}      {z_crit:.6f}      {diff:+.6f}")

  df     t_crit(95%)   z_crit(95%)   difference
  --------------------------------------------------
  5      2.570582      1.959964      +0.610618
  10     2.228139      1.959964      +0.268175
  30     2.042272      1.959964      +0.082308
  60     2.000298      1.959964      +0.040334
  120    1.979930      1.959964      +0.019966
  500    1.964720      1.959964      +0.004756
  10000  1.960201      1.959964      +0.000237

For degrees of freedom above about 100, the $t$ and $z$ critical values are nearly identical. This is why large-sample methods (which use the normal distribution) and finite-sample methods (which use the $t$ -distribution) give very similar results in practice when $n$ is large.

Summary¶

Theorem	Statement	What it means
INF.1	SE = $\sqrt{\text{diag}(\mathbf{V})}$	Standard errors from variance-covariance
INF.2	Midpoint(CI) = $\hat{\beta}$	Confidence intervals are symmetric
INF.4	$W = (\mathbf{L}\hat{\boldsymbol{\beta}})^\top (\mathbf{L}\mathbf{V}\mathbf{L}^\top)^{-1} (\mathbf{L}\hat{\boldsymbol{\beta}})$	Wald statistic for contrasts
INF.5	$F = W/q$	F-test from Wald statistic