Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Resampling Methods

UCSD Psychology

Traditional inference assumes normality and relies on asymptotic theory --- as the sample size grows, test statistics converge to known distributions (Normal, t, F, χ2\chi^2) and we read p-values off the theoretical curve. This works well when the assumptions hold, but what if your residuals are skewed, your sample is small, or you’re testing something for which no clean asymptotic result exists? Resampling methods sidestep these problems entirely.

Resampling methods --- permutation tests, bootstrapping, and cross-validation --- let the data speak for themselves. They’re conceptually simple (shuffle or resample your data, refit the model, see what happens) but have subtle mathematical properties that determine when they’re valid, how many resamples you need, and how to extract correct p-values and confidence intervals. This page covers the key theorems that ensure these methods are valid.


Permutation tests

The problem

You want to test whether an observed effect is “real” or could have arisen by chance. Classical tests answer this by assuming a parametric null distribution. Permutation tests take a different route: they repeatedly shuffle the relationship between predictors and outcome, creating a null distribution directly from the data.

Building intuition

If there’s no real effect, then the labels (which observations belong to which group) are arbitrary. Shuffling them shouldn’t change the test statistic much. If there IS an effect, shuffling will usually produce smaller statistics. The permutation distribution shows you what “no effect” actually looks like for your specific data.

# Two groups: treatment vs control
n_a, n_b = 25, 25
group_a = np.random.normal(5.0, 2.0, n_a)  # treatment
group_b = np.random.normal(4.0, 2.0, n_b)  # control
all_data = np.concatenate([group_a, group_b])

# Observed statistic
obs_diff = group_a.mean() - group_b.mean()
print(f"Observed difference: {obs_diff:.3f}")

# Permutation null distribution
n_perms = 999
perm_diffs = np.empty(n_perms)
for i in range(n_perms):
    shuffled = np.random.permutation(all_data)
    perm_diffs[i] = shuffled[:n_a].mean() - shuffled[n_a:].mean()
Observed difference: 1.248

Permutation p-values

Theorem BS.1: The permutation p-value is p=(1+count(TpermTobs))/(B+1)p = (1 + \text{count}(|T_{\text{perm}}| \geq |T_{\text{obs}}|)) / (B + 1)

# Correct p-value formula (BS.1)
count_extreme = np.sum(np.abs(perm_diffs) >= np.abs(obs_diff))
p_value = (1 + count_extreme) / (n_perms + 1)

print(f"Permutations as extreme: {count_extreme}/{n_perms}")
print(f"Permutation p-value:     {p_value:.4f}")
Permutations as extreme: 24/999
Permutation p-value:     0.0250
Why the +1 in the numerator and denominator?

The +1 in the numerator includes the observed statistic itself --- it’s one of the possible permutations. The original data arrangement is just as valid a “permutation” as any shuffled version. The +1 in the denominator normalizes by the total number of values (B permutations

  • 1 observed). This ensures the p-value is never exactly 0, which would be a false claim of certainty. With B=999 permutations, the smallest possible p-value is 1/1000 = 0.001.


Bootstrap

The problem

You want confidence intervals or standard errors but don’t want to assume normality. The bootstrap approximates the sampling distribution by resampling from your data. It’s especially useful when the quantity of interest is complex (a ratio, a median, a difference of quantiles) and no analytical formula for the standard error exists.

Building intuition

If your sample is a good representation of the population, then resampling from your sample mimics drawing new samples from the population. Each bootstrap sample is drawn with replacement (some observations appear multiple times, others are left out). The variability across bootstrap replicates approximates the variability you’d see across real repeated experiments.

Bootstrap indices

Theorem BS.2: Bootstrap samples are drawn with replacement: each index is independently sampled from {0, 1, ..., n-1} with equal probability.

n = 50
original_data = np.random.normal(10, 2, n)

# One bootstrap sample
boot_indices = np.random.choice(n, size=n, replace=True)
boot_sample = original_data[boot_indices]

# Properties of bootstrap sampling
unique_count = len(np.unique(boot_indices))
print(f"Sample size: {n}")
print(f"Unique observations in bootstrap: {unique_count}")
print(f"Fraction included: {unique_count/n:.1%}")
print(f"Expected fraction: ~{1 - np.exp(-1):.1%} (≈ 63.2%)")
Sample size: 50
Unique observations in bootstrap: 33
Fraction included: 66.0%
Expected fraction: ~63.2% (≈ 63.2%)

Bootstrap distribution

n_boot = 2000
boot_means = np.empty(n_boot)
for i in range(n_boot):
    idx = np.random.choice(n, size=n, replace=True)
    boot_means[i] = original_data[idx].mean()

# Bootstrap standard error
boot_se = boot_means.std()
# Analytical SE for comparison
analytical_se = original_data.std() / np.sqrt(n)

print(f"Sample mean:      {original_data.mean():.4f}")
print(f"Bootstrap SE:     {boot_se:.4f}")
print(f"Analytical SE:    {analytical_se:.4f}")

# Bootstrap 95% CI (percentile method)
ci_lo = np.percentile(boot_means, 2.5)
ci_hi = np.percentile(boot_means, 97.5)
print(f"Bootstrap 95% CI: [{ci_lo:.4f}, {ci_hi:.4f}]")
Sample mean:      9.9686
Bootstrap SE:     0.3082
Analytical SE:    0.3072
Bootstrap 95% CI: [9.3893, 10.5993]

BCa confidence intervals

The problem

The percentile bootstrap CI is simple but only first-order accurate --- like the CLT, its coverage error shrinks as O(1/n)O(1/\sqrt{n}). For skewed estimators or small samples, percentile intervals can be noticeably off-center. You could use more bootstrap samples, but that doesn’t fix the underlying bias.

Building intuition

BCa (bias-corrected and accelerated) intervals fix two problems with the percentile method:

  1. Bias correction (z0z_0): If the bootstrap distribution isn’t centered on the observed statistic, the percentile quantiles are off. z0z_0 measures this shift and adjusts the quantiles accordingly. If z0=0z_0 = 0, there’s no bias and no adjustment.

  2. Acceleration (aa): If the standard error of the estimator changes with the parameter value (i.e., the bootstrap distribution is skewed), symmetric percentiles give asymmetric coverage. The acceleration constant aa captures this skewness from jackknife estimates. If a=0a = 0, the estimator’s variability is constant and no skewness adjustment is needed.

Together, these corrections make BCa second-order accurate --- coverage error shrinks as O(1/n)O(1/n), a substantial improvement.

The BCa formula

The bias correction is computed from the proportion of bootstrap replicates below the observed value:

z0=Φ1(#{θ^<θ^}B)z_0 = \Phi^{-1}\left(\frac{\#\{\hat{\theta}^* < \hat{\theta}\}}{B}\right)

The acceleration is estimated from the jackknife:

a=i=1n(θˉ()θ^(i))36[i=1n(θˉ()θ^(i))2]3/2a = \frac{\sum_{i=1}^{n}(\bar{\theta}_{(\cdot)} - \hat{\theta}_{(-i)})^3}{6\left[\sum_{i=1}^{n}(\bar{\theta}_{(\cdot)} - \hat{\theta}_{(-i)})^2\right]^{3/2}}

These adjust the percentile endpoints from the naive (α/2,1α/2)(\alpha/2, 1-\alpha/2) to:

α1=Φ ⁣(z0+z0+zα/21a(z0+zα/2)),α2=Φ ⁣(z0+z0+z1α/21a(z0+z1α/2))\alpha_1 = \Phi\!\left(z_0 + \frac{z_0 + z_{\alpha/2}}{1 - a(z_0 + z_{\alpha/2})}\right), \quad \alpha_2 = \Phi\!\left(z_0 + \frac{z_0 + z_{1-\alpha/2}}{1 - a(z_0 + z_{1-\alpha/2})}\right)

Degeneration property

Theorem BS.5: When z0=0z_0 = 0 and a=0a = 0, BCa intervals are identical to percentile intervals.

from scipy import stats

# BCa adjusted percentile with z0=0, a=0
z0, a = 0.0, 0.0
alpha = 0.05

z_lo = stats.norm.ppf(alpha / 2)   # -1.96
z_hi = stats.norm.ppf(1 - alpha / 2)  # +1.96

# BCa formula: z0 + (z0 + z_alpha) / (1 - a*(z0 + z_alpha))
# With z0=0, a=0: 0 + z_alpha / 1 = z_alpha
alpha1 = stats.norm.cdf(z0 + (z0 + z_lo) / (1 - a * (z0 + z_lo)))
alpha2 = stats.norm.cdf(z0 + (z0 + z_hi) / (1 - a * (z0 + z_hi)))

print(f"BCa percentiles: ({alpha1:.4f}, {alpha2:.4f})")
print(f"Standard percentiles: ({alpha/2:.4f}, {1-alpha/2:.4f})")
print(f"Identical: {abs(alpha1 - alpha/2) < 1e-10 and abs(alpha2 - (1-alpha/2)) < 1e-10}")
BCa percentiles: (0.0250, 0.9750)
Standard percentiles: (0.0250, 0.9750)
Identical: True

This is a useful sanity check: BCa should never make things worse when there’s nothing to correct.

Acceleration and skewness

Theorem BS.6: The acceleration constant aa is zero when jackknife estimates are symmetric around their mean, and nonzero when they are skewed. The sign follows the influence function: since di=θˉθ^(i)d_i = \bar{\theta} - \hat{\theta}_{(-i)}, a high outlier in the jackknife values produces a large negative did_i whose cube dominates, giving a<0a < 0; a low outlier gives a>0a > 0.

# Symmetric jackknife → a = 0
jack_symmetric = np.array([1, 3, 5, 7, 9], dtype=float)
jack_mean = jack_symmetric.mean()  # 5.0
diff = jack_mean - jack_symmetric   # [4, 2, 0, -2, -4]

a_symmetric = np.sum(diff**3) / (6 * np.sum(diff**2)**1.5)
print(f"Symmetric jackknife: a = {a_symmetric:.6f}")

# High outlier in jackknife → large negative diff dominates → a < 0
jack_high = np.array([1, 1, 1, 1, 9], dtype=float)
jack_mean_h = jack_high.mean()
diff_h = jack_mean_h - jack_high
a_high = np.sum(diff_h**3) / (6 * np.sum(diff_h**2)**1.5)
print(f"High-outlier jackknife: a = {a_high:.6f}")

# Low outlier in jackknife → large positive diff dominates → a > 0
jack_low = np.array([1, 9, 9, 9, 9], dtype=float)
jack_mean_l = jack_low.mean()
diff_l = jack_mean_l - jack_low
a_low = np.sum(diff_l**3) / (6 * np.sum(diff_l**2)**1.5)
print(f"Low-outlier jackknife: a = {a_low:.6f}")
Symmetric jackknife: a = 0.000000
High-outlier jackknife: a = -0.111803
Low-outlier jackknife: a = 0.111803

Mixed models: cluster jackknife

For mixed models, the jackknife unit must match the data structure. Dropping individual observations from clustered data violates the independence assumed by the acceleration formula. Instead, bossanova uses a cluster jackknife --- leave one group out at a time --- which respects the hierarchical structure and produces valid acceleration estimates.


K-fold cross-validation

The problem

How well does your model predict new data? Training error is optimistic because the model has seen the data. Cross-validation holds out parts of the data to get honest estimates of prediction performance.

Building intuition

K-fold CV splits data into K equal-sized folds. For each fold, train on K-1 folds and test on the held-out fold. Every observation gets tested exactly once. The average test error across folds gives you an honest estimate of out-of-sample performance.

K-fold coverage

Theorem BS.3: K-fold cross-validation produces K disjoint folds whose union covers all observations exactly once, with each fold having n/K\lfloor n/K \rfloor or n/K\lceil n/K \rceil observations.

n = 23  # not evenly divisible by K
K = 5

# Generate fold assignments
indices = np.random.permutation(n)
folds = np.array_split(indices, K)

print(f"n={n}, K={K}")
print(f"Fold sizes: {[len(f) for f in folds]}")

# Verify coverage: every index appears exactly once
all_indices = np.sort(np.concatenate(folds))
print(f"All indices covered: {np.array_equal(all_indices, np.arange(n))}")

# Verify disjointness
for i in range(K):
    for j in range(i + 1, K):
        overlap = np.intersect1d(folds[i], folds[j])
        assert len(overlap) == 0, f"Folds {i} and {j} overlap!"
print(f"All folds disjoint: True")
n=23, K=5
Fold sizes: [5, 5, 5, 4, 4]
All indices covered: True
All folds disjoint: True
Why does “exactly once” matter?

The “exactly once” property is what makes K-fold CV unbiased. Each observation contributes to exactly one test set, so the average test error is an unbiased estimate of out-of-sample performance. The n/K\lfloor n/K \rfloor vs n/K\lceil n/K \rceil distinction handles the case when n isn’t divisible by K --- some folds get one extra observation, but this asymmetry is negligible for reasonable K.


The hat matrix trick

The problem

Leave-one-out cross-validation (LOOCV) requires fitting n separate models --- one for each left-out observation. For OLS, this seems expensive: n matrix inversions for a dataset of size n.

The shortcut

For linear regression, we can compute the LOO prediction error without refitting! The residual for observation ii when it’s left out is:

ei(i)=ei1hie_{i}^{(-i)} = \frac{e_i}{1 - h_i}

where eie_i is the ordinary residual and hih_i is the leverage (diagonal of the hat matrix).

Theorem BS.4: For OLS, the leave-one-out residual equals the ordinary residual divided by (1hi)(1 - h_i).

# Fit full model
n = 30
x = np.random.normal(0, 1, n)
X = np.column_stack([np.ones(n), x])
y = 3 + 2 * x + np.random.normal(0, 1, n)

beta = np.linalg.lstsq(X, y, rcond=None)[0]
e = y - X @ beta

# Hat matrix leverages
H = X @ np.linalg.inv(X.T @ X) @ X.T
h = np.diag(H)

# Method 1: Hat matrix trick (instant)
loo_resid_trick = e / (1 - h)

# Method 2: Actually refit n times (slow)
loo_resid_actual = np.empty(n)
for i in range(n):
    mask = np.ones(n, dtype=bool)
    mask[i] = False
    beta_loo = np.linalg.lstsq(X[mask], y[mask], rcond=None)[0]
    loo_resid_actual[i] = y[i] - X[i] @ beta_loo

# They match!
print(f"Max difference: {np.max(np.abs(loo_resid_trick - loo_resid_actual)):.2e}")

# PRESS statistic (prediction error sum of squares)
PRESS = np.sum(loo_resid_trick**2)
RSS = np.sum(e**2)
print(f"\nRSS (training error):  {RSS:.4f}")
print(f"PRESS (LOO CV error):  {PRESS:.4f}")
print(f"PRESS > RSS: {PRESS > RSS}  (always true — honest error is larger)")
Max difference: 6.61e-15

RSS (training error):  28.6086
PRESS (LOO CV error):  32.8383
PRESS > RSS: True  (always true — honest error is larger)
Why does the hat matrix trick work?

The key insight is that the hat matrix H\mathbf{H} tells you how much each observation influences its own prediction. An observation with leverage hih_i close to 1 strongly determines its own fit --- removing it would change the prediction a lot. The formula ei/(1hi)e_i / (1 - h_i) inflates the residual by exactly the right amount to account for this self-influence, giving the same result as actually refitting without observation ii. This turns an O(np2n)O(n \cdot p^2 \cdot n) procedure into an O(p2n)O(p^2 \cdot n) one --- a factor of n speedup for free.


Summary

TheoremStatementWhat it means
BS.1p=(1+count)/(B+1)p = (1 + \text{count}) / (B + 1)Correct permutation p-value with +1
BS.2With-replacement samplingBootstrap samples ~63.2% unique
BS.3K-fold: complete, disjoint, balancedEvery observation tested exactly once
BS.4LOO residual =ei/(1hi)= e_i / (1 - h_i)Fast cross-validation via hat matrix
BS.5BCa → percentile when z0=a=0z_0 = a = 0BCa never distorts unbiased estimators
BS.6a=0a = 0 for symmetric jackknifeAcceleration captures influence skewness