Correlation Tests - bossanova

Classical Test	bossanova Equivalent	Relationship Type
Pearson correlation	`model("zscore(y) ~ zscore(x)", df)`	Linear
Spearman correlation	`model("zscore(rank(y)) ~ zscore(rank(x))", df)`	Monotonic

When both variables are standardized, the regression slope equals the correlation coefficient. bossanova provides zscore() and other expressions for common data transformations.

import numpy as np
import polars as pl
from bossanova.model import model
from bossanova import load_dataset
np.random.seed(42)

# Load penguins dataset
penguins = load_dataset("penguins").drop_nulls()

Pearson Correlation¶

Classical:

r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}, \quad t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t(n-2) \text{ under } H_0: \rho = 0

(1)

As GLM:

z_y \sim \mathcal{N}(\mu, \sigma^2), \quad \mu = \beta_0 + \beta_1 z_x

(2)

\hat{\beta}_1 = r, \quad t_{\beta_1} = t_r

(3)

where $z_x$ and $z_y$ are standardized variables. When both variables are z-scored, the slope equals $r$ and the $t$ -tests are identical.

scipy¶

from scipy.stats import pearsonr

scipy_pearson = pearsonr(penguins["bill_length_mm"].to_numpy(), penguins["flipper_length_mm"].to_numpy())
scipy_pearson

PearsonRResult(statistic=np.float64(0.6561813407464278), pvalue=np.float64(1.7439736176207624e-43))

bossanova¶

m = model("zscore(flipper_length_mm) ~ zscore(bill_length_mm)", penguins).fit().infer()

m.params.select("estimate", "statistic", "p_value")

bn_slope = float(m.params["estimate"][1])
bn_p = float(m.params["p_value"][1])
assert np.isclose(bn_slope, scipy_pearson.statistic, rtol=1e-3), f"r mismatch: {bn_slope} vs {scipy_pearson.statistic}"
assert np.isclose(bn_p, scipy_pearson.pvalue, rtol=1e-3), f"p mismatch: {bn_p} vs {scipy_pearson.pvalue}"

Spearman Rank Correlation¶

Classical:

\rho = r(\text{rank}(x), \text{rank}(y)), \quad t = \frac{\rho\sqrt{n-2}}{\sqrt{1-\rho^2}} \sim t(n-2) \text{ under } H_0: \rho_s = 0

(4)

Pearson correlation applied to ranks.

As GLM:

z_{\text{rank}(y)} \sim \mathcal{N}(\mu, \sigma^2), \quad \mu = \beta_0 + \beta_1 z_{\text{rank}(x)}

(5)

\hat{\beta}_1 = \rho, \quad t_{\beta_1} = t_\rho

(6)

Spearman correlation captures monotonic relationships and is robust to outliers and non-linearity. The same GLM trick applies — z-score the ranks, and the slope equals $\rho$ .

scipy¶

from scipy.stats import spearmanr

scipy_spearman = spearmanr(penguins["bill_length_mm"].to_numpy(), penguins["flipper_length_mm"].to_numpy())
scipy_spearman

SignificanceResult(statistic=np.float64(0.6727719416255543), pvalue=np.float64(2.0669356276079203e-46))

bossanova¶

m_spearman = model("zscore(rank(flipper_length_mm)) ~ zscore(rank(bill_length_mm))", penguins).fit().infer()

m_spearman.params.select("estimate", "statistic", "p_value")

bn_rho = float(m_spearman.params["estimate"][1])
bn_p_sp = float(m_spearman.params["p_value"][1])
# With zscore(rank()), the slope equals Spearman's rho exactly
assert np.isclose(bn_rho, scipy_spearman.statistic, rtol=1e-3), f"rho mismatch: {bn_rho} vs {scipy_spearman.statistic}"
assert np.isclose(bn_p_sp, scipy_spearman.pvalue, rtol=1e-3), f"p mismatch: {bn_p_sp} vs {scipy_spearman.pvalue}"