Traditional inference assumes normality and relies on asymptotic theory --- as the sample size grows, test statistics converge to known distributions (Normal, t, F, ) and we read p-values off the theoretical curve. This works well when the assumptions hold, but what if your residuals are skewed, your sample is small, or you’re testing something for which no clean asymptotic result exists? Resampling methods sidestep these problems entirely.
Resampling methods --- permutation tests, bootstrapping, and cross-validation --- let the data speak for themselves. They’re conceptually simple (shuffle or resample your data, refit the model, see what happens) but have subtle mathematical properties that determine when they’re valid, how many resamples you need, and how to extract correct p-values and confidence intervals. This page covers the key theorems that ensure these methods are valid.
Permutation tests¶
The problem¶
You want to test whether an observed effect is “real” or could have arisen by chance. Classical tests answer this by assuming a parametric null distribution. Permutation tests take a different route: they repeatedly shuffle the relationship between predictors and outcome, creating a null distribution directly from the data.
Building intuition¶
If there’s no real effect, then the labels (which observations belong to which group) are arbitrary. Shuffling them shouldn’t change the test statistic much. If there IS an effect, shuffling will usually produce smaller statistics. The permutation distribution shows you what “no effect” actually looks like for your specific data.
# Two groups: treatment vs control
n_a, n_b = 25, 25
group_a = np.random.normal(5.0, 2.0, n_a) # treatment
group_b = np.random.normal(4.0, 2.0, n_b) # control
all_data = np.concatenate([group_a, group_b])
# Observed statistic
obs_diff = group_a.mean() - group_b.mean()
print(f"Observed difference: {obs_diff:.3f}")
# Permutation null distribution
n_perms = 999
perm_diffs = np.empty(n_perms)
for i in range(n_perms):
shuffled = np.random.permutation(all_data)
perm_diffs[i] = shuffled[:n_a].mean() - shuffled[n_a:].mean()Observed difference: 1.248
Permutation p-values¶
Theorem BS.1: The permutation p-value is
# Correct p-value formula (BS.1)
count_extreme = np.sum(np.abs(perm_diffs) >= np.abs(obs_diff))
p_value = (1 + count_extreme) / (n_perms + 1)
print(f"Permutations as extreme: {count_extreme}/{n_perms}")
print(f"Permutation p-value: {p_value:.4f}")Permutations as extreme: 24/999
Permutation p-value: 0.0250
Why the +1 in the numerator and denominator?
The +1 in the numerator includes the observed statistic itself --- it’s one of the possible permutations. The original data arrangement is just as valid a “permutation” as any shuffled version. The +1 in the denominator normalizes by the total number of values (B permutations
1 observed). This ensures the p-value is never exactly 0, which would be a false claim of certainty. With B=999 permutations, the smallest possible p-value is 1/1000 = 0.001.
Bootstrap¶
The problem¶
You want confidence intervals or standard errors but don’t want to assume normality. The bootstrap approximates the sampling distribution by resampling from your data. It’s especially useful when the quantity of interest is complex (a ratio, a median, a difference of quantiles) and no analytical formula for the standard error exists.
Building intuition¶
If your sample is a good representation of the population, then resampling from your sample mimics drawing new samples from the population. Each bootstrap sample is drawn with replacement (some observations appear multiple times, others are left out). The variability across bootstrap replicates approximates the variability you’d see across real repeated experiments.
Bootstrap indices¶
Theorem BS.2: Bootstrap samples are drawn with replacement: each index is independently sampled from {0, 1, ..., n-1} with equal probability.
n = 50
original_data = np.random.normal(10, 2, n)
# One bootstrap sample
boot_indices = np.random.choice(n, size=n, replace=True)
boot_sample = original_data[boot_indices]
# Properties of bootstrap sampling
unique_count = len(np.unique(boot_indices))
print(f"Sample size: {n}")
print(f"Unique observations in bootstrap: {unique_count}")
print(f"Fraction included: {unique_count/n:.1%}")
print(f"Expected fraction: ~{1 - np.exp(-1):.1%} (≈ 63.2%)")Sample size: 50
Unique observations in bootstrap: 33
Fraction included: 66.0%
Expected fraction: ~63.2% (≈ 63.2%)
Bootstrap distribution¶
n_boot = 2000
boot_means = np.empty(n_boot)
for i in range(n_boot):
idx = np.random.choice(n, size=n, replace=True)
boot_means[i] = original_data[idx].mean()
# Bootstrap standard error
boot_se = boot_means.std()
# Analytical SE for comparison
analytical_se = original_data.std() / np.sqrt(n)
print(f"Sample mean: {original_data.mean():.4f}")
print(f"Bootstrap SE: {boot_se:.4f}")
print(f"Analytical SE: {analytical_se:.4f}")
# Bootstrap 95% CI (percentile method)
ci_lo = np.percentile(boot_means, 2.5)
ci_hi = np.percentile(boot_means, 97.5)
print(f"Bootstrap 95% CI: [{ci_lo:.4f}, {ci_hi:.4f}]")Sample mean: 9.9686
Bootstrap SE: 0.3082
Analytical SE: 0.3072
Bootstrap 95% CI: [9.3893, 10.5993]
BCa confidence intervals¶
The problem¶
The percentile bootstrap CI is simple but only first-order accurate --- like the CLT, its coverage error shrinks as . For skewed estimators or small samples, percentile intervals can be noticeably off-center. You could use more bootstrap samples, but that doesn’t fix the underlying bias.
Building intuition¶
BCa (bias-corrected and accelerated) intervals fix two problems with the percentile method:
Bias correction (): If the bootstrap distribution isn’t centered on the observed statistic, the percentile quantiles are off. measures this shift and adjusts the quantiles accordingly. If , there’s no bias and no adjustment.
Acceleration (): If the standard error of the estimator changes with the parameter value (i.e., the bootstrap distribution is skewed), symmetric percentiles give asymmetric coverage. The acceleration constant captures this skewness from jackknife estimates. If , the estimator’s variability is constant and no skewness adjustment is needed.
Together, these corrections make BCa second-order accurate --- coverage error shrinks as , a substantial improvement.
The BCa formula¶
The bias correction is computed from the proportion of bootstrap replicates below the observed value:
The acceleration is estimated from the jackknife:
These adjust the percentile endpoints from the naive to:
Degeneration property¶
Theorem BS.5: When and , BCa intervals are identical to percentile intervals.
from scipy import stats
# BCa adjusted percentile with z0=0, a=0
z0, a = 0.0, 0.0
alpha = 0.05
z_lo = stats.norm.ppf(alpha / 2) # -1.96
z_hi = stats.norm.ppf(1 - alpha / 2) # +1.96
# BCa formula: z0 + (z0 + z_alpha) / (1 - a*(z0 + z_alpha))
# With z0=0, a=0: 0 + z_alpha / 1 = z_alpha
alpha1 = stats.norm.cdf(z0 + (z0 + z_lo) / (1 - a * (z0 + z_lo)))
alpha2 = stats.norm.cdf(z0 + (z0 + z_hi) / (1 - a * (z0 + z_hi)))
print(f"BCa percentiles: ({alpha1:.4f}, {alpha2:.4f})")
print(f"Standard percentiles: ({alpha/2:.4f}, {1-alpha/2:.4f})")
print(f"Identical: {abs(alpha1 - alpha/2) < 1e-10 and abs(alpha2 - (1-alpha/2)) < 1e-10}")BCa percentiles: (0.0250, 0.9750)
Standard percentiles: (0.0250, 0.9750)
Identical: True
This is a useful sanity check: BCa should never make things worse when there’s nothing to correct.
Acceleration and skewness¶
Theorem BS.6: The acceleration constant is zero when jackknife estimates are symmetric around their mean, and nonzero when they are skewed. The sign follows the influence function: since , a high outlier in the jackknife values produces a large negative whose cube dominates, giving ; a low outlier gives .
# Symmetric jackknife → a = 0
jack_symmetric = np.array([1, 3, 5, 7, 9], dtype=float)
jack_mean = jack_symmetric.mean() # 5.0
diff = jack_mean - jack_symmetric # [4, 2, 0, -2, -4]
a_symmetric = np.sum(diff**3) / (6 * np.sum(diff**2)**1.5)
print(f"Symmetric jackknife: a = {a_symmetric:.6f}")
# High outlier in jackknife → large negative diff dominates → a < 0
jack_high = np.array([1, 1, 1, 1, 9], dtype=float)
jack_mean_h = jack_high.mean()
diff_h = jack_mean_h - jack_high
a_high = np.sum(diff_h**3) / (6 * np.sum(diff_h**2)**1.5)
print(f"High-outlier jackknife: a = {a_high:.6f}")
# Low outlier in jackknife → large positive diff dominates → a > 0
jack_low = np.array([1, 9, 9, 9, 9], dtype=float)
jack_mean_l = jack_low.mean()
diff_l = jack_mean_l - jack_low
a_low = np.sum(diff_l**3) / (6 * np.sum(diff_l**2)**1.5)
print(f"Low-outlier jackknife: a = {a_low:.6f}")Symmetric jackknife: a = 0.000000
High-outlier jackknife: a = -0.111803
Low-outlier jackknife: a = 0.111803
Mixed models: cluster jackknife¶
For mixed models, the jackknife unit must match the data structure. Dropping individual observations from clustered data violates the independence assumed by the acceleration formula. Instead, bossanova uses a cluster jackknife --- leave one group out at a time --- which respects the hierarchical structure and produces valid acceleration estimates.
K-fold cross-validation¶
The problem¶
How well does your model predict new data? Training error is optimistic because the model has seen the data. Cross-validation holds out parts of the data to get honest estimates of prediction performance.
Building intuition¶
K-fold CV splits data into K equal-sized folds. For each fold, train on K-1 folds and test on the held-out fold. Every observation gets tested exactly once. The average test error across folds gives you an honest estimate of out-of-sample performance.
K-fold coverage¶
Theorem BS.3: K-fold cross-validation produces K disjoint folds whose union covers all observations exactly once, with each fold having or observations.
n = 23 # not evenly divisible by K
K = 5
# Generate fold assignments
indices = np.random.permutation(n)
folds = np.array_split(indices, K)
print(f"n={n}, K={K}")
print(f"Fold sizes: {[len(f) for f in folds]}")
# Verify coverage: every index appears exactly once
all_indices = np.sort(np.concatenate(folds))
print(f"All indices covered: {np.array_equal(all_indices, np.arange(n))}")
# Verify disjointness
for i in range(K):
for j in range(i + 1, K):
overlap = np.intersect1d(folds[i], folds[j])
assert len(overlap) == 0, f"Folds {i} and {j} overlap!"
print(f"All folds disjoint: True")n=23, K=5
Fold sizes: [5, 5, 5, 4, 4]
All indices covered: True
All folds disjoint: True
Why does “exactly once” matter?
The “exactly once” property is what makes K-fold CV unbiased. Each observation contributes to exactly one test set, so the average test error is an unbiased estimate of out-of-sample performance. The vs distinction handles the case when n isn’t divisible by K --- some folds get one extra observation, but this asymmetry is negligible for reasonable K.
The hat matrix trick¶
The problem¶
Leave-one-out cross-validation (LOOCV) requires fitting n separate models --- one for each left-out observation. For OLS, this seems expensive: n matrix inversions for a dataset of size n.
The shortcut¶
For linear regression, we can compute the LOO prediction error without refitting! The residual for observation when it’s left out is:
where is the ordinary residual and is the leverage (diagonal of the hat matrix).
Theorem BS.4: For OLS, the leave-one-out residual equals the ordinary residual divided by .
# Fit full model
n = 30
x = np.random.normal(0, 1, n)
X = np.column_stack([np.ones(n), x])
y = 3 + 2 * x + np.random.normal(0, 1, n)
beta = np.linalg.lstsq(X, y, rcond=None)[0]
e = y - X @ beta
# Hat matrix leverages
H = X @ np.linalg.inv(X.T @ X) @ X.T
h = np.diag(H)
# Method 1: Hat matrix trick (instant)
loo_resid_trick = e / (1 - h)
# Method 2: Actually refit n times (slow)
loo_resid_actual = np.empty(n)
for i in range(n):
mask = np.ones(n, dtype=bool)
mask[i] = False
beta_loo = np.linalg.lstsq(X[mask], y[mask], rcond=None)[0]
loo_resid_actual[i] = y[i] - X[i] @ beta_loo
# They match!
print(f"Max difference: {np.max(np.abs(loo_resid_trick - loo_resid_actual)):.2e}")
# PRESS statistic (prediction error sum of squares)
PRESS = np.sum(loo_resid_trick**2)
RSS = np.sum(e**2)
print(f"\nRSS (training error): {RSS:.4f}")
print(f"PRESS (LOO CV error): {PRESS:.4f}")
print(f"PRESS > RSS: {PRESS > RSS} (always true — honest error is larger)")Max difference: 6.61e-15
RSS (training error): 28.6086
PRESS (LOO CV error): 32.8383
PRESS > RSS: True (always true — honest error is larger)
Why does the hat matrix trick work?
The key insight is that the hat matrix tells you how much each observation influences its own prediction. An observation with leverage close to 1 strongly determines its own fit --- removing it would change the prediction a lot. The formula inflates the residual by exactly the right amount to account for this self-influence, giving the same result as actually refitting without observation . This turns an procedure into an one --- a factor of n speedup for free.
Summary¶
| Theorem | Statement | What it means |
|---|---|---|
| BS.1 | Correct permutation p-value with +1 | |
| BS.2 | With-replacement sampling | Bootstrap samples ~63.2% unique |
| BS.3 | K-fold: complete, disjoint, balanced | Every observation tested exactly once |
| BS.4 | LOO residual | Fast cross-validation via hat matrix |
| BS.5 | BCa → percentile when | BCa never distorts unbiased estimators |
| BS.6 | for symmetric jackknife | Acceleration captures influence skewness |