Within-Subject Designs

Classical Test	bossanova Equivalent	When
Paired t-test	`model("y ~ condition + (1\|subject)", df)`	2 conditions
RM-ANOVA	`model("y ~ condition + (1\|subject)", df)`	3+ conditions
Wilcoxon signed-rank (paired)	`model("rank(y) ~ condition + (1\|subject)", df)`	2 conditions, robust
Friedman test	`model("rank(y) ~ condition + (1\|subject)", df)`	3+ conditions, robust
RM-ANOVA + covariate	`model("y ~ condition + covariate + (1\|subject)", df)`	Between-subject covariate

Notice that paired tests and repeated measures are the same formula. Adding (1|subject) accounts for within-subject variation; with 2 conditions this is equivalent to the paired t-test, with 3+ conditions it becomes RM-ANOVA.

import numpy as np
import polars as pl
from bossanova.model import model
from bossanova import load_dataset
from scipy.stats import ttest_ind, mannwhitneyu, f_oneway, kruskal

penguins = load_dataset("penguins").drop_nulls().filter(pl.col("sex").is_in(["female", "male"]))

Two conditions (paired t-test)¶

Classical:

d_i = x_{1i} - x_{2i}, \quad t = \frac{\bar{d}}{s_d / \sqrt{n}}, \quad t \sim t(n-1) \text{ under } H_0: \mu_d = 0

(1)

where $n$ is the number of paired observations.

As GLM (mixed):

y_{ij} \sim \mathcal{N}(\mu_{ij}, \sigma^2), \quad \mu_{ij} = \beta_0 + \beta_1 x_j + u_i, \quad u_i \sim \mathcal{N}(0, \sigma_u^2)

(2)

The varying intercept $u_i$ absorbs subject-level baseline differences — equivalent to computing difference scores. The test of $H_0: \beta_1 = 0$ asks whether conditions differ after accounting for within-subject variation.

scipy¶

male = penguins.filter(pl.col("sex") == "male")["body_mass_g"].to_numpy()
female = penguins.filter(pl.col("sex") == "female")["body_mass_g"].to_numpy()

scipy_ttest = ttest_ind(male, female)
scipy_ttest

TtestResult(statistic=np.float64(8.541720337994516), pvalue=np.float64(4.897246751596224e-16), df=np.float64(331.0))

bossanova (simple linear model)¶

Without varying effects, the linear model recovers the classical independent t-test exactly:

m_lm = model("body_mass_g ~ sex", penguins).fit().infer()

m_lm.params[1].select("term", "statistic", "df", "p_value")

bn_t_lm = float(m_lm.params["statistic"][1])
bn_p_lm = float(m_lm.params["p_value"][1])
assert np.isclose(abs(bn_t_lm), abs(scipy_ttest.statistic), rtol=1e-3), f"t mismatch: {bn_t_lm} vs {scipy_ttest.statistic}"
assert np.isclose(bn_p_lm, scipy_ttest.pvalue, rtol=1e-3), f"p mismatch: {bn_p_lm} vs {scipy_ttest.pvalue}"

bossanova (mixed model)¶

Adding (1|species) accounts for species-level variation, giving a more powerful test -- the t-statistic is larger because residual variance is reduced:

m = model("body_mass_g ~ sex + (1|species)", penguins).fit().infer()

m.params.select("term", "statistic", "df", "p_value")

The mixed-model statistic differs from the classical t-test because it partitions variance into species-level and residual components. By accounting for group structure, the test becomes more sensitive.

bn_t_mixed = float(m.params.filter(pl.col("term").str.contains("sex"))["statistic"][0])
assert abs(bn_t_mixed) > abs(bn_t_lm), f"Mixed model t ({bn_t_mixed}) should exceed lm t ({bn_t_lm})"
bn_p = float(m.params.filter(pl.col("term").str.contains("sex"))["p_value"][0])
assert bn_p < 1e-10, f"Expected highly significant p-value, got {bn_p}"

Three+ conditions (RM-ANOVA)¶

Classical:

F = \frac{MS_{\text{condition}}}{MS_{\text{error}}}, \quad F \sim F(k-1,\; (k-1)(n-1)) \text{ under } H_0 \text{ (assuming sphericity)}

(3)

As GLM (mixed):

y_{ij} \sim \mathcal{N}(\mu_{ij}, \sigma^2), \quad \mu_{ij} = \beta_0 + \sum_{l=1}^{k-1} \beta_l x_{lj} + u_i, \quad u_i \sim \mathcal{N}(0, \sigma_u^2)

(4)

Standard ANOVA ignores within-group correlation, inflating Type I error. The varying intercept captures group-level baseline differences, properly partitioning variance.

scipy¶

adelie = penguins.filter(pl.col("species") == "Adelie")["flipper_length_mm"].to_numpy()
chinstrap = penguins.filter(pl.col("species") == "Chinstrap")["flipper_length_mm"].to_numpy()
gentoo = penguins.filter(pl.col("species") == "Gentoo")["flipper_length_mm"].to_numpy()

scipy_anova = f_oneway(adelie, chinstrap, gentoo)
scipy_anova

F_onewayResult(statistic=np.float64(567.4069920123421), pvalue=np.float64(1.5874180554406345e-107))

bossanova¶

m_rm = model("flipper_length_mm ~ species + (1|island)", penguins).fit().infer()

m_rm.params.select("term", "estimate", "statistic", "p_value")

# Joint F-test for species effect
m_rm.infer("joint").effects

# Random effects variance and model fit
m_rm.diagnostics.select("aic", "bic", "rsquared_marginal", "rsquared_conditional", "icc")

species_ps = m_rm.params.filter(pl.col("term").str.contains("species"))["p_value"].to_numpy()
assert any(p < 1e-10 for p in species_ps), f"Expected highly significant species effect, got {species_ps}"
assert m_rm.diagnostics["aic"].to_numpy()[0] < np.inf, "Expected finite AIC"
assert scipy_anova.pvalue < 1e-10, f"Expected highly significant scipy F-test, got {scipy_anova.pvalue}"

Rank-based variants (robust)¶

The rank() transformation makes within-subject inference robust to outliers and non-normality. With 2 conditions this parallels the Wilcoxon signed-rank test; with 3+ it parallels the Friedman test.

Two conditions (Wilcoxon)¶

Classical:

W^+ = \sum_{d_i > 0} R_i, \quad \text{where } R_i = \text{rank}(|d_i|) \text{ and } d_i = x_{1i} - x_{2i}

(5)

As GLM (mixed):

y_{ij}^* \sim \mathcal{N}(\mu_{ij}, \sigma^2), \quad \mu_{ij} = \beta_0 + \beta_1 x_j + u_i, \quad \text{where } y_{ij}^* = \text{rank}(y_{ij})

(6)

scipy_mann = mannwhitneyu(male, female)
scipy_mann

MannwhitneyuResult(statistic=np.float64(20845.5), pvalue=np.float64(1.8133343032461053e-15))

m_wilcox = model("rank(body_mass_g) ~ sex + (1|species)", penguins).fit().infer()

m_wilcox.params.filter(pl.col("term").str.contains("sex")).select("term", "statistic", "p_value")

scipy reports the Mann-Whitney U statistic; bossanova reports a t-statistic on ranks with varying intercepts. The test statistics differ but both test H_0: no group difference and yield equivalent conclusions.

bn_p_w = float(m_wilcox.params.filter(pl.col("term").str.contains("sex"))["p_value"][0])
assert bn_p_w < 1e-10, f"Expected highly significant bossanova p-value, got {bn_p_w}"
assert scipy_mann.pvalue < 1e-10, f"Expected highly significant scipy p-value, got {scipy_mann.pvalue}"

Three+ conditions (Friedman)¶

Classical:

\chi^2_F = \frac{12}{nk(k+1)} \sum_{j=1}^{k} R_j^2 - 3n(k+1), \quad \chi^2_F \dot{\sim} \chi^2(k-1) \text{ under } H_0

(7)

where $R_j$ is the sum of ranks for condition $j$ across $n$ subjects.

As GLM (mixed):

y_{ij}^* \sim \mathcal{N}(\mu_{ij}, \sigma^2), \quad \mu_{ij} = \beta_0 + \sum_{l=1}^{k-1} \beta_l x_{lj} + u_i, \quad \text{where } y_{ij}^* = \text{rank}(y_{ij})

(8)

Joint $F$ -test on the condition coefficients parallels the Friedman $\chi^2$ .

scipy_kruskal = kruskal(adelie, chinstrap, gentoo)
scipy_kruskal

KruskalResult(statistic=np.float64(237.34574750210166), pvalue=np.float64(2.890851468876691e-52))

m_friedman = model("rank(flipper_length_mm) ~ species + (1|island)", penguins).fit().infer("joint")

m_friedman.effects

friedman_bn_p = float(m_friedman.effects["p_value"][0])
assert (friedman_bn_p < 0.05) == (scipy_kruskal.pvalue < 0.05), (
    f"Significance disagreement: bn p={friedman_bn_p}, scipy p={scipy_kruskal.pvalue}"
)

Adding covariates¶

Mixed models naturally extend to include between-subject covariates -- there is no simple classical equivalent.

m_cov = model("flipper_length_mm ~ species + sex + (1|island)", penguins).fit().infer()

m_cov.params.select("term", "estimate", "p_value")

assert m_cov.params.shape[0] > 1, "Expected multiple coefficients"
species_ps = m_cov.params.filter(pl.col("term").str.contains("species"))["p_value"].to_numpy()
assert any(p < 0.05 for p in species_ps), f"Expected at least one significant species effect"
sex_ps = m_cov.params.filter(pl.col("term").str.contains("sex"))["p_value"].to_numpy()
assert any(p < 0.05 for p in sex_ps), f"Expected significant sex effect"