Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Statistical Inference

UCSD Psychology

Fitting a model gives you coefficient estimates, but estimates alone don’t tell you much. If you fit y ~ x and get β^1=0.42\hat{\beta}_1 = 0.42, your first question should be: how uncertain is that number? Could it easily be 0.10 or 0.80, or is 0.42 tightly pinned down by the data?

Statistical inference provides the machinery for answering these questions. Standard errors quantify how much each coefficient would bounce around if you repeated the experiment. Confidence intervals give plausible ranges for the true value. Hypothesis tests formalize “is this effect real, or could chance explain it?” This page shows where those numbers come from, building from variance-covariance matrices up through the Wald and F-tests.


From variance-covariance to standard errors

The problem

After fitting y=Xβ+ε\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, we get β^\hat{\boldsymbol{\beta}}. But how precise is each coefficient? A coefficient of 0.42 means something very different if its standard error is 0.02 versus 2.0.

Building intuition

The variance-covariance matrix V=σ2(XX)1\mathbf{V} = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} captures how much each coefficient varies across hypothetical repeated samples, and how coefficients co-vary with each other. The diagonal entries are the variances of each coefficient. The standard errors are just the square roots of those diagonal entries.

# Small regression: n=50, 3 predictors + intercept
n, p = 50, 4
X = np.column_stack([np.ones(n), np.random.randn(n, 3)])
true_beta = np.array([2.0, 0.5, -1.0, 0.3])
y = X @ true_beta + np.random.randn(n) * 1.5

# Fit via QR decomposition
Q, R = np.linalg.qr(X)
beta_hat = np.linalg.solve(R, Q.T @ y)

# Residual variance: sigma^2 = RSS / (n - p)
residuals = y - X @ beta_hat
sigma2 = np.sum(residuals**2) / (n - p)
print(f"sigma^2 = {sigma2:.4f}")

# Variance-covariance matrix: V = sigma^2 * (X'X)^-1
vcov = sigma2 * np.linalg.inv(X.T @ X)
print(f"\nVariance-covariance matrix V:\n{vcov}")

# Standard errors: sqrt of diagonal
se = np.sqrt(np.diag(vcov))
print(f"\nStandard errors: {se}")
sigma^2 = 1.8308

Variance-covariance matrix V:
[[0.0378 0.0054 0.0049 0.0041]
 [0.0054 0.072  0.0089 0.0121]
 [0.0049 0.0089 0.0377 0.003 ]
 [0.0041 0.0121 0.003  0.0357]]

Standard errors: [0.1944 0.2683 0.1942 0.189 ]

The formal statement

Theorem INF.1: SEj=diag(V)j\text{SE}_j = \sqrt{\text{diag}(\mathbf{V})_j} where V=σ2(XX)1\mathbf{V} = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} is the variance-covariance matrix of the coefficient estimates.

Why does this work?

The diagonal element Vjj=Var(β^j)\mathbf{V}_{jj} = \text{Var}(\hat{\beta}_j) is the variance of the jj-th coefficient estimate across hypothetical repeated samples. Taking the square root gives the standard deviation of the sampling distribution of β^j\hat{\beta}_j, which is the standard error. The off-diagonal elements capture correlations between coefficients but do not directly affect individual standard errors.

Seeing it in bossanova

from bossanova.internal.maths.inference import compute_se_from_vcov

vcov = sigma2 * np.linalg.inv(X.T @ X)
se = compute_se_from_vcov(vcov)
print("Standard errors:", se)
Standard errors: [0.1944 0.2683 0.1942 0.189 ]

Confidence intervals

The problem

A point estimate β^j\hat{\beta}_j is a single number. We want a range that has a known probability of containing the true value. But where does that range come from?

Building intuition

The confidence interval is β^j±tcrit×SEj\hat{\beta}_j \pm t_{\text{crit}} \times \text{SE}_j. The critical value tcritt_{\text{crit}} comes from the tt-distribution with npn - p degrees of freedom. For 95% confidence with large samples, it’s approximately 1.96 (the familiar z-value), but for smaller samples the tt-distribution has heavier tails, making the interval wider.

# 95% confidence intervals for each coefficient
alpha = 0.05
df = n - p
t_crit = stats.t.ppf(1 - alpha / 2, df)
print(f"t-critical (df={df}): {t_crit:.4f}")

ci_lo = beta_hat - t_crit * se
ci_hi = beta_hat + t_crit * se

print("\nCoefficient   Estimate    CI_lo      CI_hi")
for j in range(p):
    print(f"  beta_{j}:     {beta_hat[j]:>7.4f}    {ci_lo[j]:>7.4f}    {ci_hi[j]:>7.4f}")

# Verify midpoint equals estimate
midpoint = (ci_lo + ci_hi) / 2
print(f"\nMidpoint == beta_hat? {np.allclose(midpoint, beta_hat)}")
t-critical (df=46): 2.0129

Coefficient   Estimate    CI_lo      CI_hi
  beta_0:      2.1118     1.7206     2.5030
  beta_1:      0.7009     0.1609     1.2409
  beta_2:     -1.2111    -1.6020    -0.8202
  beta_3:      0.3008    -0.0797     0.6812

Midpoint == beta_hat? True

The formal statement

Theorem INF.2: (CIlo+CIhi)/2=β^(\text{CI}_{\text{lo}} + \text{CI}_{\text{hi}})/2 = \hat{\beta} -- confidence intervals are centered at the point estimate.

Why is this true?

The CI is constructed as β^±cSE\hat{\beta} \pm c \cdot \text{SE}. The midpoint of [β^cSE, β^+cSE][\hat{\beta} - c \cdot \text{SE},\ \hat{\beta} + c \cdot \text{SE}] is exactly β^\hat{\beta}. This holds for any symmetric reference distribution (tt or zz) and any confidence level.

Width tells you precision

# CI width = 2 * t_crit * SE
ci_width = ci_hi - ci_lo
expected_width = 2 * t_crit * se

print("Coefficient   SE        CI width   2*t*SE")
for j in range(p):
    print(f"  beta_{j}:     {se[j]:.4f}    {ci_width[j]:.4f}     {expected_width[j]:.4f}")

print(f"\nWidth == 2*t*SE? {np.allclose(ci_width, expected_width)}")
Coefficient   SE        CI width   2*t*SE
  beta_0:     0.1944    0.7824     0.7824
  beta_1:     0.2683    1.0800     1.0800
  beta_2:     0.1942    0.7818     0.7818
  beta_3:     0.1890    0.7609     0.7609

Width == 2*t*SE? True

Wider intervals mean less precise estimates. A coefficient with a wide CI relative to its magnitude is poorly determined by the data.

Seeing it in bossanova

from bossanova.internal.maths.inference import compute_ci

t_crit = stats.t.ppf(1 - 0.05 / 2, df)
ci_lo, ci_hi = compute_ci(beta_hat, se, critical=t_crit)
for j in range(len(beta_hat)):
    print(f"  beta_{j}: {beta_hat[j]:.4f}  [{ci_lo[j]:.4f}, {ci_hi[j]:.4f}]")
  beta_0: 2.1118  [1.7206, 2.5030]
  beta_1: 0.7009  [0.1609, 1.2409]
  beta_2: -1.2111  [-1.6020, -0.8202]
  beta_3: 0.3008  [-0.0797, 0.6812]

The Wald statistic

The problem

We want to test whether a set of coefficients (a “contrast”) is significantly different from zero. For a single coefficient, the tt-statistic works. But what if we want to test multiple coefficients simultaneously?

Building intuition

For a single coefficient, the tt-statistic is t=β^j/SEjt = \hat{\beta}_j / \text{SE}_j. For multiple coefficients simultaneously, we need the Wald statistic, which generalizes this idea to account for correlations between estimates. The Wald statistic uses a contrast matrix L\mathbf{L} that picks out which linear combinations of coefficients to test.

# Single coefficient: t-statistic
t_stat = beta_hat[1] / se[1]
print(f"t-statistic for beta_1: {t_stat:.4f}")
print(f"t^2:                    {t_stat**2:.4f}")

# Same thing via Wald formula: W = (L @ beta)' @ inv(L @ V @ L') @ (L @ beta)
L_single = np.eye(p)[1:2]            # Picks out beta_1
Lb = L_single @ beta_hat             # = [beta_1]
LVL = L_single @ vcov @ L_single.T   # = Var(beta_1), a 1x1 matrix
W_single = Lb @ np.linalg.inv(LVL) @ Lb
print(f"\nWald statistic (q=1):   {W_single:.4f}")
print(f"Matches t^2?            {np.isclose(W_single, t_stat**2)}")

# Multivariate: test beta_1 and beta_2 jointly
L_multi = np.eye(p)[1:3]             # Picks out beta_1 and beta_2
Lb_multi = L_multi @ beta_hat
LVL_multi = L_multi @ vcov @ L_multi.T
W_multi = Lb_multi @ np.linalg.inv(LVL_multi) @ Lb_multi
print(f"\nWald statistic (q=2):   {W_multi:.4f}")
t-statistic for beta_1: 2.6127
t^2:                    6.8264

Wald statistic (q=1):   6.8264
Matches t^2?            True

Wald statistic (q=2):   52.8611

The formal statement

Theorem INF.4: W=(Lβ^)(LVL)1(Lβ^)W = (\mathbf{L}\hat{\boldsymbol{\beta}})^\top (\mathbf{L}\mathbf{V}\mathbf{L}^\top)^{-1} (\mathbf{L}\hat{\boldsymbol{\beta}})

What does this measure?

The Wald statistic is a quadratic form that measures how many standard errors the contrast estimates are from zero, accounting for their correlations through the variance-covariance structure. For a single contrast (q=1q=1), this reduces to (Lβ^)2/Var(Lβ^)=t2(\mathbf{L}\hat{\boldsymbol{\beta}})^2 / \text{Var}(\mathbf{L}\hat{\boldsymbol{\beta}}) = t^2. For multiple contrasts, the matrix inverse “decorrelates” the estimates before measuring their distance from zero.

# Verify: Wald with q=1 is exactly t^2
L = np.eye(p)[1:2]
Lb = L @ beta_hat
LVL = L @ vcov @ L.T
W = Lb @ np.linalg.inv(LVL) @ Lb
t2 = (beta_hat[1] / se[1]) ** 2
print(f"Wald (q=1): {W:.6f}")
print(f"t^2:        {t2:.6f}")
print(f"Match:      {np.isclose(W, t2)}")
Wald (q=1): 6.826353
t^2:        6.826353
Match:      True

Seeing it in bossanova

from bossanova.internal.maths.inference import (
    compute_wald_statistic,
    compute_contrast_variance,
)

L = np.eye(len(beta_hat))[1:2]  # Test second coefficient
contrast_values = L @ beta_hat
contrast_vcov = compute_contrast_variance(L, vcov)
W = compute_wald_statistic(contrast_values, contrast_vcov)
print(f"Wald statistic: {W:.4f}")
print(f"t^2 (should match): {(beta_hat[1] / se[1])**2:.4f}")
Wald statistic: 6.8264
t^2 (should match): 6.8264

From Wald to F-test

The problem

The Wald statistic has a chi-squared distribution asymptotically. But for finite samples (which is most real data), the chi-squared approximation can be too liberal. The FF-distribution provides a finite-sample correction.

Building intuition

F=W/qF = W / q, where qq is the number of constraints being tested (the rank of L\mathbf{L}). This normalization accounts for the fact that testing more constraints simultaneously will naturally produce a larger Wald statistic, so we need a higher threshold.

# q=1: F = W/1 = t^2
L1 = np.eye(p)[1:2]
Lb1 = L1 @ beta_hat
LVL1 = L1 @ vcov @ L1.T
W1 = Lb1 @ np.linalg.inv(LVL1) @ Lb1
q1 = L1.shape[0]
F1 = W1 / q1
t_sq = (beta_hat[1] / se[1]) ** 2

print(f"q=1:  W={W1:.4f},  F=W/q={F1:.4f},  t^2={t_sq:.4f}")
print(f"F == t^2? {np.isclose(F1, t_sq)}")

# q=2: test beta_1 and beta_2 jointly
L2 = np.eye(p)[1:3]
Lb2 = L2 @ beta_hat
LVL2 = L2 @ vcov @ L2.T
W2 = Lb2 @ np.linalg.inv(LVL2) @ Lb2
q2 = L2.shape[0]
F2 = W2 / q2
p_val = 1 - stats.f.cdf(F2, dfn=q2, dfd=df)

print(f"\nq=2:  W={W2:.4f},  F=W/q={F2:.4f},  p={p_val:.4f}")
q=1:  W=6.8264,  F=W/q=6.8264,  t^2=6.8264
F == t^2? True

q=2:  W=52.8611,  F=W/q=26.4305,  p=0.0000

The formal statement

Theorem INF.5: F=W/qF = W / q where q=rank(L)q = \text{rank}(\mathbf{L}).

Distribution theory

The Wald statistic Wχ2(q)W \sim \chi^2(q) asymptotically. Dividing by qq gives FF(q,dfresid)F \sim F(q, \text{df}_{\text{resid}}) for finite samples. The FF-distribution accounts for the estimation uncertainty in σ2\sigma^2. For large residual degrees of freedom, Fχ2/qF \approx \chi^2 / q because the FF-distribution converges to the chi-squared distribution (scaled by 1/q1/q) as the denominator df goes to infinity.

t vs z: when do they converge?

# For large df, the t-distribution is nearly identical to the normal
print("  df     t_crit(95%)   z_crit(95%)   difference")
print("  " + "-" * 50)
z_crit = stats.norm.ppf(0.975)
for test_df in [5, 10, 30, 60, 120, 500, 10000]:
    t_crit_val = stats.t.ppf(0.975, test_df)
    diff = t_crit_val - z_crit
    print(f"  {test_df:<6} {t_crit_val:.6f}      {z_crit:.6f}      {diff:+.6f}")
  df     t_crit(95%)   z_crit(95%)   difference
  --------------------------------------------------
  5      2.570582      1.959964      +0.610618
  10     2.228139      1.959964      +0.268175
  30     2.042272      1.959964      +0.082308
  60     2.000298      1.959964      +0.040334
  120    1.979930      1.959964      +0.019966
  500    1.964720      1.959964      +0.004756
  10000  1.960201      1.959964      +0.000237

For degrees of freedom above about 100, the tt and zz critical values are nearly identical. This is why large-sample methods (which use the normal distribution) and finite-sample methods (which use the tt-distribution) give very similar results in practice when nn is large.


Summary

TheoremStatementWhat it means
INF.1SE = diag(V)\sqrt{\text{diag}(\mathbf{V})}Standard errors from variance-covariance
INF.2Midpoint(CI) = β^\hat{\beta}Confidence intervals are symmetric
INF.4W=(Lβ^)(LVL)1(Lβ^)W = (\mathbf{L}\hat{\boldsymbol{\beta}})^\top (\mathbf{L}\mathbf{V}\mathbf{L}^\top)^{-1} (\mathbf{L}\hat{\boldsymbol{\beta}})Wald statistic for contrasts
INF.5F=W/qF = W/qF-test from Wald statistic