Two-Sample Tests - bossanova

Classical Test	bossanova Equivalent	Variance Assumption
Independent t-test	`model("y ~ group", df)`	Equal (pooled)
Welch’s t-test	`model("y ~ group", df).infer(errors="unequal_var")`	Unequal (Welch df)
Mann-Whitney U	`model("rank(y) ~ group", df)`	Robust to outliers

All examples use the included penguins dataset.

import numpy as np
import polars as pl
from bossanova.model import model
from bossanova import load_dataset

np.random.seed(42)

# Load penguins dataset (filter out "NA" sex values which are strings, not nulls)
penguins = load_dataset("penguins").drop_nulls().filter(pl.col("sex") != "NA")

Independent t-test (Equal Variances)¶

Classical:

t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}, \quad s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}, \quad t \sim t(n_1+n_2-2) \text{ under } H_0

(1)

As GLM:

y_i \sim \mathcal{N}(\mu_i, \sigma^2), \quad \mu_i = \beta_0 + \beta_1 g_i, \quad g_i \in \{0, 1\}

(2)

t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} \sim t(n_1+n_2-2) \text{ under } H_0: \beta_1 = 0

(3)

The slope $\hat{\beta}_1 = \bar{x}_1 - \bar{x}_2$ is the group mean difference; $t_{\text{classical}} = t_{\beta_1}$ exactly.

scipy¶

from scipy.stats import ttest_ind

male_bill = penguins.filter(pl.col("sex") == "male")["bill_depth_mm"].to_numpy()
female_bill = penguins.filter(pl.col("sex") == "female")["bill_depth_mm"].to_numpy()

scipy_ttest_eq = ttest_ind(male_bill, female_bill, equal_var=True)
scipy_ttest_eq

TtestResult(statistic=np.float64(7.306540245129378), pvalue=np.float64(2.066410345755146e-12), df=np.float64(331.0))

bossanova¶

m = model("bill_depth_mm ~ sex", penguins).fit().infer()

m.params[1].select("statistic", "df", "p_value")

bn_t = float(m.params["statistic"][1])
bn_p = float(m.params["p_value"][1])
assert np.isclose(abs(bn_t), abs(scipy_ttest_eq.statistic), rtol=1e-3), f"t mismatch: {bn_t} vs {scipy_ttest_eq.statistic}"
assert np.isclose(bn_p, scipy_ttest_eq.pvalue, rtol=1e-3), f"p mismatch: {bn_p} vs {scipy_ttest_eq.pvalue}"

Welch’s t-test (Unequal Variances)¶

Classical:

t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}, \quad t \sim t(df_W) \text{ under } H_0

(4)

df_W = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}

(5)

As GLM:

y_i \sim \mathcal{N}(\mu_i, \sigma^2), \quad \mu_i = \beta_0 + \beta_1 g_i

(6)

\text{Var}(\hat{\beta}_1) = \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}, \quad t = \frac{\hat{\beta}_1}{\sqrt{\text{Var}(\hat{\beta}_1)}} \sim t(df_W) \text{ under } H_0: \beta_1 = 0

(7)

Same structural model as the pooled t-test — only the variance estimator and degrees of freedom change.

scipy¶

# Compare Adelie vs Gentoo bill length (species with different variances)
adelie = penguins.filter(pl.col("species") == "Adelie")["bill_length_mm"].to_numpy()
gentoo = penguins.filter(pl.col("species") == "Gentoo")["bill_length_mm"].to_numpy()

scipy_welch = ttest_ind(gentoo, adelie, equal_var=False)
scipy_welch

TtestResult(statistic=np.float64(24.286066500471392), pvalue=np.float64(7.821528746388473e-66), df=np.float64(233.5085891130877))

bossanova¶

df_species = penguins.filter(pl.col("species").is_in(["Adelie", "Gentoo"])).select("bill_length_mm", "species")
m_welch = model("bill_length_mm ~ species", df_species).fit().infer(errors="unequal_var")

m_welch.params[1].select("statistic", "df", "p_value")

bn_t_w = float(m_welch.params["statistic"][1])
bn_p_w = float(m_welch.params["p_value"][1])
bn_df = float(m_welch.params["df"][1])

# Welch t, p, df should all match scipy exactly
assert np.isclose(abs(bn_t_w), abs(scipy_welch.statistic), rtol=1e-3), f"t mismatch: {bn_t_w} vs {scipy_welch.statistic}"
assert np.isclose(bn_p_w, scipy_welch.pvalue, rtol=1e-3), f"p mismatch: {bn_p_w} vs {scipy_welch.pvalue}"
assert np.isclose(bn_df, scipy_welch.df, rtol=1e-3), f"df mismatch: {bn_df} vs {scipy_welch.df}"

Mann-Whitney U Test¶

Classical:

U = n_1 n_2 + \frac{n_1(n_1+1)}{2} - R_1, \quad z = \frac{U - n_1 n_2 / 2}{\sqrt{n_1 n_2 (n_1+n_2+1)/12}} \dot{\sim} \mathcal{N}(0,1) \text{ under } H_0

(8)

where $R_1$ is the sum of ranks in group 1.

As GLM:

y_i^* \sim \mathcal{N}(\mu_i, \sigma^2), \quad \mu_i = \beta_0 + \beta_1 g_i, \quad \text{where } y_i^* = \text{rank}(y_i)

(9)

t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} \sim t(n_1+n_2-2) \text{ under } H_0: \beta_1 = 0

(10)

The rank transformation makes inference robust to outliers and non-normality. The GLM replaces the $U$ statistic with a $t$ -test on ranks — both test the same null of equal group locations.

scipy¶

from scipy.stats import mannwhitneyu

male_mass = penguins.filter(pl.col("sex") == "male")["body_mass_g"].to_numpy()
female_mass = penguins.filter(pl.col("sex") == "female")["body_mass_g"].to_numpy()

scipy_mw = mannwhitneyu(male_mass, female_mass, alternative='two-sided')
scipy_mw

MannwhitneyuResult(statistic=np.float64(20845.5), pvalue=np.float64(1.8133343032461053e-15))

bossanova¶

m_mw = model("rank(body_mass_g) ~ sex", penguins).fit().infer()

m_mw.params[1].select("statistic", "p_value")

scipy reports the Mann-Whitney U statistic; bossanova reports a t-statistic on ranks. The test statistics differ but both test H₀: equal group locations and yield equivalent p-values.

bn_p_mw = float(m_mw.params["p_value"][1])
# Both methods should strongly reject (large sex difference in body mass)
assert bn_p_mw < 1e-10, f"Expected highly significant bossanova p-value, got {bn_p_mw}"
assert scipy_mw.pvalue < 1e-10, f"Expected highly significant scipy p-value, got {scipy_mw.pvalue}"
# Normal approximations should be in the same ballpark
assert np.isclose(bn_p_mw, scipy_mw.pvalue, rtol=0.5), f"p-value gap too large: {bn_p_mw} vs {scipy_mw.pvalue}"