simulation - bossanova

Simulation utilities for parameter recovery and Monte Carlo studies.

Call chain:

model.simulate(n=...) -> generate_{lm,glm,lmer,glmer}_data() (pre-fit DGP)
model.simulate(nsim=...) -> simulate_responses_from_fit() (post-fit parametric bootstrap)
run_power_study(formula, ...) -> run_power_analysis() -> expand_sweep_grid()

Classes:

Name	Description
`MonteCarloResult`	Results from a Monte Carlo simulation study.
`SimulateResult`	Immutable result of the simulate lifecycle.

Functions:

Name	Description
`bias`	Compute bias: E[beta_hat] - beta_true.
`compute_mc_iteration`	Execute a single Monte Carlo iteration.
`compute_mu_with_new_re`	Compute conditional mean with newly sampled random effects.
`compute_wilson_ci`	Wilson score confidence interval for a binomial proportion.
`coverage`	Compute coverage probability.
`empirical_se`	Compute empirical standard error (SD of estimates).
`execute_simulate`	Execute simulation: power analysis, post-fit sampling, or pre-fit generation.
`expand_sweep_grid`	Full factorial grid from base DGP + power sweep overrides.
`generate_data_from_spec`	Generate a synthetic dataset from a simulation specification.
`generate_glm_data`	Generate GLM data with known parameters.
`generate_glmer_data`	Generate GLMM data with known parameters.
`generate_lm_data`	Generate linear model data with known parameters.
`generate_lmer_data`	Generate linear mixed model data with known parameters.
`mean_se`	Compute mean of standard errors across simulations.
`rejection_rate`	Compute rejection rate (proportion of p-values < alpha).
`rmse`	Compute root mean squared error.
`run_monte_carlo`	Run a Monte Carlo simulation study.
`run_power_analysis`	Run simulation-based power analysis for a model formula.
`run_power_study`	Run power analysis across a sweep grid.
`simulate_responses_from_fit`	Simulate new responses from a fitted model.

Modules:

Name	Description
`dgp`	Data generating process (DGP) utilities for simulation studies.
`harness`	Monte Carlo simulation harness for parameter recovery studies.
`lifecycle`	Simulate lifecycle orchestration.
`metrics`	Metrics for Monte Carlo simulation studies.
`model_sim`	Model-level simulation operations.
`power`	Power analysis via simulation sweep grids.

Classes¶

MonteCarloResult¶

Results from a Monte Carlo simulation study.

Stores estimates, standard errors, confidence intervals, and p-values from repeated simulations, along with ground truth parameters.

Created by: run_monte_carlo() Consumed by: User analysis code (bias, coverage, power methods) Augmented by: Never (immutable result container)

Attributes:

Name	Type	Description
`estimates`	`ndarray`	Array of shape (n_sims, n_params) with coefficient estimates.
`std_errors`	`ndarray`	Array of shape (n_sims, n_params) with standard errors.
`ci_lower`	`ndarray`	Array of shape (n_sims, n_params) with CI lower bounds.
`ci_upper`	`ndarray`	Array of shape (n_sims, n_params) with CI upper bounds.
`p_values`	`ndarray`	Array of shape (n_sims, n_params) with p-values.
`true_values`	`dict[str, float]`	Named mapping of true parameter values (e.g., {“Intercept”: 0.0, “x”: 0.5}).
`param_names`	`tuple[str, ...]`	Tuple of parameter names matching array columns.
`n_sims`	`int`	Number of successful simulations.
`n_failed`	`int`	Number of simulations that failed to converge.

Examples:

result = run_monte_carlo(
    dgp_fn=generate_lm_data,
    dgp_params={"n": 100, "beta": [1.0, 2.0], "sigma": 1.0},
    fit_fn=lambda d: model("y ~ x1", data=d).fit(),
    n_sims=500,
)
result.bias("x1")  # Should be close to 0
result.coverage("x1")  # Should be close to 0.95

Functions:

Name	Description
`bias`	Compute bias for a parameter: E[beta_hat] - beta_true.
`coverage`	Compute coverage probability for a parameter.
`empirical_se`	Compute empirical SE (SD of estimates) for a parameter.
`get_idx`	Get array index for a parameter.
`get_true_value`	Get true value for a parameter.
`mean_se`	Compute mean estimated standard error for a parameter.
`power`	Compute power for a parameter.
`rmse`	Compute RMSE for a parameter.
`summary`	Return summary DataFrame with all metrics for all parameters.
`type1_rate`	Compute Type I error rate for a parameter.

Attributes¶

ci_lower¶

ci_lower: np.ndarray = field(validator=is_ndarray)

ci_upper¶

ci_upper: np.ndarray = field(validator=is_ndarray)

estimates¶

estimates: np.ndarray = field(validator=is_ndarray)

n_failed¶

n_failed: int = field(default=0)

n_sims¶

n_sims: int = field(default=0)

p_values¶

p_values: np.ndarray = field(validator=is_ndarray)

param_names¶

param_names: tuple[str, ...] = field(converter=tuple, factory=tuple)

std_errors¶

std_errors: np.ndarray = field(validator=is_ndarray)

true_values¶

true_values: dict[str, float] = field(factory=dict)

Functions¶

bias¶

bias(param: str | int) -> float

Compute bias for a parameter: E[beta_hat] - beta_true.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`float`	Bias value.

coverage¶

coverage(param: str | int, level: float = 0.95) -> float

Compute coverage probability for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required
`level`	`float`	Confidence level (default 0.95). Currently only used for documentation; actual CIs come from the model.	`0.95`

Returns:

Type	Description
`float`	Coverage probability (0 to 1).

empirical_se¶

empirical_se(param: str | int) -> float

Compute empirical SE (SD of estimates) for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`float`	Empirical SE.

get_idx¶

get_idx(param: str | int) -> int

Get array index for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`int`	Integer index into the parameter arrays.

get_true_value¶

get_true_value(param: str | int) -> float

Get true value for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`float`	True parameter value.

mean_se¶

mean_se(param: str | int) -> float

Compute mean estimated standard error for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`float`	Mean SE across simulations.

power¶

power(param: str | int, alpha: float = 0.05) -> float

Compute power for a parameter.

This is the rejection rate when the true effect is non-zero. Use this for parameters where true_value != 0.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required
`alpha`	`float`	Significance level (default 0.05).	`0.05`

Returns:

Type	Description
`float`	Power (probability of correctly rejecting H0).

rmse¶

rmse(param: str | int) -> float

Compute RMSE for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`float`	Root mean squared error.

summary¶

summary() -> pl.DataFrame

Return summary DataFrame with all metrics for all parameters.

Returns:

Type	Description
`DataFrame`	DataFrame with columns: param, true_value, mean_est, bias,
`DataFrame`	rmse, mean_se, empirical_se, coverage, rejection_rate.

type1_rate¶

type1_rate(param: str | int, alpha: float = 0.05) -> float

Compute Type I error rate for a parameter.

This is the rejection rate when the true effect is zero. Use this for parameters where true_value = 0.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required
`alpha`	`float`	Significance level (default 0.05).	`0.05`

Returns:

Type	Description
`float`	Type I error rate (should be ~ alpha if correctly calibrated).

SimulateResult¶

Immutable result of the simulate lifecycle.

Attributes:

Name	Type	Description
`mode`	`str`	One of `"power"`, `"simulate"`, or `"generate"`.
`simulations`	`DataFrame \| None`	Simulated response DataFrame (post-fit mode).
`generated_data`	`DataFrame \| None`	Generated DataFrame (pre-fit mode).
`power`	`MonteCarloResult \| None`	Monte Carlo power result (power mode).

Attributes¶

generated_data¶

generated_data: pl.DataFrame | None = None

mode¶

mode: str

power¶

power: MonteCarloResult | None = None

simulations¶

simulations: pl.DataFrame | None = None

Functions¶

bias¶

bias(estimates: ArrayLike, true_value: float) -> float

Compute bias: E[beta_hat] - beta_true.

Parameters:

Name	Type	Description	Default
`estimates`	`ArrayLike`	Array of parameter estimates from simulations.	required
`true_value`	`float`	True parameter value.	required

Returns:

Type	Description
`float`	Bias (positive = overestimate, negative = underestimate).

Examples:

estimates = np.array([1.02, 0.98, 1.01, 0.99, 1.00])
bias(estimates, true_value=1.0)
# 0.0

compute_mc_iteration¶

compute_mc_iteration(seed: int, dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], dgp_params: dict[str, Any], fit_fn: Callable[[pl.DataFrame], Any]) -> dict[str, Any] | None

Execute a single Monte Carlo iteration.

Standalone function (not a closure) for joblib serialization compatibility. Returns None if fitting fails.

Parameters:

Name	Type	Description	Default
`seed`	`int`	Random seed for this iteration.	required
`dgp_fn`	`Callable..., [tuple[DataFrame, dict[str, Any]]]`	Data generating function returning (data, true_params).	required
`dgp_params`	`dict[str, Any]`	Parameters to pass to dgp_fn (excluding seed).	required
`fit_fn`	`Callable[[DataFrame], Any]`	Function that takes a DataFrame and returns a fitted model.	required

Returns:

Type	Description
`dict[str, Any] \| None`	Dictionary with ‘params_df’ and ‘true_params’ on success, or
`dict[str, Any] \| None`	None if the model fit fails.

compute_mu_with_new_re¶

compute_mu_with_new_re(bundle: 'DataBundle', fit: 'FitState', spec: 'ModelSpec', rng: np.random.Generator) -> np.ndarray

Compute conditional mean with newly sampled random effects.

Draws u_new ~ N(0, I), transforms via the relative covariance factor Lambda to get new BLUPs, then computes mu = X @ coef + Z @ (Lambda @ u_new * sigma).

Parameters:

Name	Type	Description	Default
`bundle`	`‘DataBundle’`	Data bundle with X, Z matrices and RE metadata.	required
`fit`	`‘FitState’`	Fit state with theta, sigma, and coefficients.	required
`spec`	`‘ModelSpec’`	Model specification with family and link info.	required
`rng`	`Generator`	NumPy random generator for reproducibility.	required

Returns:

Type	Description
`ndarray`	New mu values on the response scale, shape `(n,)`.

compute_wilson_ci¶

compute_wilson_ci(p_hat: float, n: int, level: float = 0.95) -> tuple[float, float]

Wilson score confidence interval for a binomial proportion.

Better coverage than the Wald interval near p=0 or p=1.

Parameters:

Name	Type	Description	Default
`p_hat`	`float`	Observed proportion (0 to 1).	required
`n`	`int`	Number of trials.	required
`level`	`float`	Confidence level (default 0.95).	`0.95`

Returns:

Type	Description
`tuple[float, float]`	Tuple of (lower, upper) bounds.

Examples:

compute_wilson_ci(0.8, n=100, level=0.95)
# (0.711..., 0.868...)

coverage¶

coverage(ci_lower: ArrayLike, ci_upper: ArrayLike, true_value: float) -> float

Compute coverage probability.

Coverage = P(CI_lower <= beta_true <= CI_upper)

For a correctly calibrated 95% CI, coverage should be ~0.95.

Parameters:

Name	Type	Description	Default
`ci_lower`	`ArrayLike`	Array of CI lower bounds from simulations.	required
`ci_upper`	`ArrayLike`	Array of CI upper bounds from simulations.	required
`true_value`	`float`	True parameter value.	required

Returns:

Type	Description
`float`	Coverage probability (0 to 1).

Examples:

ci_lower = np.array([0.8, 0.9, 0.85, 0.95, 0.88])
ci_upper = np.array([1.2, 1.1, 1.15, 1.05, 1.12])
coverage(ci_lower, ci_upper, true_value=1.0)
# 1.0

empirical_se¶

empirical_se(estimates: ArrayLike) -> float

Compute empirical standard error (SD of estimates).

This is the “true” standard error as estimated from the simulation distribution.

Parameters:

Name	Type	Description	Default
`estimates`	`ArrayLike`	Array of parameter estimates from simulations.	required

Returns:

Type	Description
`float`	Empirical SE (standard deviation of estimates).

execute_simulate¶

execute_simulate(spec: ModelSpec, bundle: DataBundle | None, fit: FitState | None, data: pl.DataFrame | None, formula: str, is_mixed: bool, n: int | None, nsim: int | None, seed: int | None, coef: dict[str, float] | None, sigma: float, varying: str, power: int | dict[str, Any] | None, var_specs: dict) -> SimulateResult

Execute simulation: power analysis, post-fit sampling, or pre-fit generation.

Parameters:

Name	Type	Description	Default
`spec`	`ModelSpec`	Model specification.	required
`bundle`	`DataBundle \| None`	Data bundle (None if pre-fit).	required
`fit`	`FitState \| None`	Fitted state (None if pre-fit).	required
`data`	`DataFrame \| None`	Current data (None if simulation-first).	required
`formula`	`str`	Formula string.	required
`is_mixed`	`bool`	Whether this is a mixed-effects model.	required
`n`	`int \| None`	Sample size for pre-fit generation or power analysis.	required
`nsim`	`int \| None`	Number of simulated response vectors (post-fit).	required
`seed`	`int \| None`	Random seed.	required
`coef`	`dict[str, float] \| None`	True coefficient values (pre-fit mode).	required
`sigma`	`float`	Residual SD (pre-fit gaussian).	required
`varying`	`str`	RE handling in post-fit (`"fitted"` or `"sample"`).	required
`power`	`int \| dict[str, Any] \| None`	Power analysis config (int or dict), or None.	required
`var_specs`	`dict`	Distribution specs for predictors.	required

Returns:

Type	Description
`SimulateResult`	SimulateResult with populated fields for the chosen mode.

expand_sweep_grid¶

expand_sweep_grid(base_n: int, base_coef: dict[str, float], base_sigma: float, base_varying: dict[str, dict[str, Any]], power_config: dict[str, Any]) -> list[dict[str, Any]]

Full factorial grid from base DGP + power sweep overrides.

Uses itertools.product for Cartesian expansion. Handles nested coef dict merging (e.g., sweeping one coefficient while keeping others fixed).

Parameters:

Name	Type	Description	Default
`base_n`	`int`	Base sample size.	required
`base_coef`	`dict[str, float]`	Base coefficient values (e.g., `{"Intercept": 0.0, "x": 0.5}`).	required
`base_sigma`	`float`	Base residual standard deviation.	required
`base_varying`	`dict[str, dict[str, Any]]`	Base random effect specs (empty dict for fixed-effects models).	required
`power_config`	`dict[str, Any]`	Sweep configuration dict. Keys are sweep axes (`"n"`, `"coef"`, `"sigma"`, `"varying"`). Non-sweep control keys (`"n_sims"`, `"alpha"`, `"conf_level"`, `"n_jobs"`, `"verbose"`, `"seed"`) are ignored.	required

Returns:

Type	Description
`list[dict[str, Any]]`	List of grid point dicts, each with keys: `"n"`, `"coef"`,
`list[dict[str, Any]]`	`"sigma"`, `"varying"`.

Examples:

>>> grid = expand_sweep_grid(
...     base_n=100, base_coef={"x": 0.5}, base_sigma=1.0,
...     base_varying={},
...     power_config={"n": [50, 100, 200]},
... )
>>> len(grid)
3
>>> grid[0]["n"]
50

generate_data_from_spec¶

generate_data_from_spec(sim_spec: 'SimulationSpec', family: str, response_var: str) -> pl.DataFrame

Generate a synthetic dataset from a simulation specification.

Implements the data-generating process (DGP): samples predictors from their distributions, constructs the design matrix, computes the linear predictor, and draws response values from the appropriate family.

Parameters:

Name	Type	Description	Default
`sim_spec`	`‘SimulationSpec’`	Simulation specification with sample size, distributions, coefficients, and random effect structure.	required
`family`	`str`	GLM family name (e.g. “gaussian”, “binomial”, “poisson”).	required
`response_var`	`str`	Name for the response column in the output DataFrame.	required

Returns:

Type	Description
`DataFrame`	Polars DataFrame with generated predictor and response columns.

generate_glm_data¶

generate_glm_data(n: int, beta: ArrayLike, family: FamilyType = 'gaussian', link: LinkType = None, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate GLM data with known parameters.

g(E[y]) = X @ beta g(E[y]) = X @ beta

where g is the link function determined by family/link.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of observations to generate.	required
`beta`	`ArrayLike`	True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1).	required
`family`	`FamilyType`	Response distribution: - “gaussian”: Normal distribution (default) - “binomial”: Binary outcomes (0/1) - “poisson”: Count data	`‘gaussian’`
`link`	`LinkType`	Link function. If None, uses canonical link for family: - gaussian: “identity” - binomial: “logit” - poisson: “log”	`None`
`sigma`	`float`	Residual SD (only used for gaussian family, default 1.0).	`1.0`
`x_type`	`Literal[‘gaussian’, ‘uniform’, ‘binary’]`	Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1}	`‘gaussian’`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix).	`None`
`seed`	`int \| None`	Random seed for reproducibility. If None, uses random state.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: polars DataFrame with columns y, x1, x2, ...
`tuple[DataFrame, dict]`	- true_params: dict with keys “beta”, “family”, “link”, and “sigma” (for gaussian)

Examples:

Old API (still works)::

# Logistic regression
data, params = generate_glm_data(
    n=200, beta=[0.0, 1.5], family="binomial"
)
data["y"].mean()  # Should be ~0.5 (balanced)

# Poisson regression
data, params = generate_glm_data(
    n=100, beta=[1.0, 0.5], family="poisson"
)

New API with distribution objects::

from bossanova.distributions import normal, uniform

data, params = generate_glm_data(
    n=200,
    beta=[0.0, 1.5],
    family="binomial",
    distributions={"x1": uniform(-2, 2)},
)

generate_glmer_data¶

generate_glmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, family: FamilyType = 'binomial', re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate GLMM data with known parameters.

g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i b_i ~ N(0, G)

where g is the canonical link function for the family.

Parameters:

Name	Type	Description	Default
`n_obs`	`int`	Total number of observations.	required
`n_groups`	`int`	Number of groups/clusters.	required
`beta`	`ArrayLike`	Fixed effect coefficients [intercept, slope1, ...].	required
`theta`	`ArrayLike`	Random effect parameters (see generate_lmer_data for details).	required
`family`	`FamilyType`	Response distribution: - “binomial”: Binary outcomes (0/1) with logit link - “poisson”: Count data with log link	`‘binomial’`
`re_structure`	`Literal[‘intercept’, ‘slope’, ‘both’]`	Random effects structure: - “intercept”: Random intercepts only - “slope”: Random slopes only - “both”: Random intercepts and slopes	`‘intercept’`
`obs_per_group`	`int \| None`	Observations per group. If None, uses n_obs // n_groups.	`None`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc.	`None`
`seed`	`int \| None`	Random seed for reproducibility.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: DataFrame with columns y, x1, group
`tuple[DataFrame, dict]`	- true_params: dict with beta, theta, family, and ranef

Examples:

Old API (still works)::

# Binomial GLMM with random intercepts
data, params = generate_glmer_data(
    n_obs=400, n_groups=40,
    beta=[0.0, 1.0],
    theta=[0.5],  # SD of random intercepts
    family="binomial",
)

New API with distribution objects::

from bossanova.distributions import uniform

data, params = generate_glmer_data(
    n_obs=400, n_groups=40,
    beta=[0.0, 1.0],
    theta=[0.5],
    family="binomial",
    distributions={"x1": uniform(-1, 1)},
)

generate_lm_data¶

generate_lm_data(n: int, beta: ArrayLike, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate linear model data with known parameters.

y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon epsilon ~ N(0, sigma^2)

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of observations to generate.	required
`beta`	`ArrayLike`	True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1).	required
`sigma`	`float`	Residual standard deviation (default 1.0).	`1.0`
`x_type`	`Literal[‘gaussian’, ‘uniform’, ‘binary’]`	Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1}	`‘gaussian’`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix).	`None`
`seed`	`int \| None`	Random seed for reproducibility. If None, uses random state.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: polars DataFrame with columns y, x1, x2, ...
`tuple[DataFrame, dict]`	- true_params: dict with keys “beta” (array) and “sigma” (float)

Examples:

Old API (still works)::

data, params = generate_lm_data(n=100, beta=[1.0, 2.0], sigma=0.5)
data.shape
# (100, 2)
params["beta"]
# array([1., 2.])

New API with distribution objects::

from bossanova.distributions import normal, categorical

data, params = generate_lm_data(
    n=100,
    beta=[0.0, 0.5],
    distributions={"x1": normal(0, 2)},
)

generate_lmer_data¶

generate_lmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, sigma: float = 1.0, re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate linear mixed model data with known parameters.

y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij b_i ~ N(0, G) epsilon_ij ~ N(0, sigma^2)

Parameters:

Name	Type	Description	Default
`n_obs`	`int`	Total number of observations. If obs_per_group is None, observations are distributed evenly across groups.	required
`n_groups`	`int`	Number of groups/clusters.	required
`beta`	`ArrayLike`	Fixed effect coefficients [intercept, slope1, ...].	required
`theta`	`ArrayLike`	Random effect parameters. Interpretation depends on re_structure: - “intercept”: [sigma_intercept] - SD of random intercepts - “slope”: [sigma_slope] - SD of random slopes (no random intercept) - “both”: [sigma_intercept, sigma_slope, corr] or [sigma_int, sigma_slope] If length 2, assumes uncorrelated (corr=0).	required
`sigma`	`float`	Residual standard deviation (default 1.0).	`1.0`
`re_structure`	`Literal[‘intercept’, ‘slope’, ‘both’]`	Random effects structure: - “intercept”: Random intercepts only (1	group) - “slope”: Random slopes only (0 + x1
`obs_per_group`	`int \| None`	Observations per group. If None, uses n_obs // n_groups.	`None`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc.	`None`
`seed`	`int \| None`	Random seed for reproducibility.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: DataFrame with columns y, x1, group
`tuple[DataFrame, dict]`	- true_params: dict with beta, theta, sigma, and ranef (actual random effects)

Examples:

Old API (still works)::

# Random intercepts model
data, params = generate_lmer_data(
    n_obs=200, n_groups=20,
    beta=[1.0, 0.5],
    theta=[0.5],  # SD of random intercepts
    sigma=1.0,
)

New API with distribution objects::

from bossanova.distributions import uniform

data, params = generate_lmer_data(
    n_obs=200, n_groups=20,
    beta=[1.0, 0.5],
    theta=[0.5],
    distributions={"x1": uniform(-1, 1)},
)

mean_se¶

mean_se(std_errors: ArrayLike) -> float

Compute mean of standard errors across simulations.

Parameters:

Name	Type	Description	Default
`std_errors`	`ArrayLike`	Array of SE estimates from simulations.	required

Returns:

Type	Description
`float`	Mean SE.

rejection_rate¶

rejection_rate(p_values: ArrayLike, alpha: float = 0.05) -> float

Compute rejection rate (proportion of p-values < alpha).

For null effects (beta=0), this estimates Type I error rate. For non-null effects (beta!=0), this estimates power.

Parameters:

Name	Type	Description	Default
`p_values`	`ArrayLike`	Array of p-values from simulations.	required
`alpha`	`float`	Significance level (default 0.05).	`0.05`

Returns:

Type	Description
`float`	Rejection rate (0 to 1).

Examples:

p_values = np.array([0.01, 0.06, 0.03, 0.12, 0.04])
rejection_rate(p_values, alpha=0.05)
# 0.6

rmse¶

rmse(estimates: ArrayLike, true_value: float) -> float

Compute root mean squared error.

RMSE = sqrt(E[(beta_hat - beta)^2]) = sqrt(Var(beta_hat) + Bias^2)

Parameters:

Name	Type	Description	Default
`estimates`	`ArrayLike`	Array of parameter estimates from simulations.	required
`true_value`	`float`	True parameter value.	required

Returns:

Type	Description
`float`	RMSE value.

Examples:

estimates = np.array([1.1, 0.9, 1.0, 1.0, 1.0])
rmse(estimates, true_value=1.0)
# 0.063...

run_monte_carlo¶

run_monte_carlo(dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], dgp_params: dict[str, Any], fit_fn: Callable[[pl.DataFrame], Any], n_sims: int = 1000, seed: int = 42, n_jobs: int = 1, verbose: bool = False) -> MonteCarloResult

Run a Monte Carlo simulation study.

Generates data from a DGP, fits models, and collects results for parameter recovery analysis.

Parameters:

Name	Type	Description	Default
`dgp_fn`	`Callable..., [tuple[DataFrame, dict[str, Any]]]`	Data generating function that returns (data, true_params). Must accept a `seed` parameter.	required
`dgp_params`	`dict[str, Any]`	Parameters to pass to dgp_fn (excluding seed).	required
`fit_fn`	`Callable[[DataFrame], Any]`	Function that takes a DataFrame and returns a fitted model. The model must expose a `.params` property returning a Polars DataFrame with columns matching `Col.TERM`, `Col.ESTIMATE`, etc.	required
`n_sims`	`int`	Number of simulations to run (default 1000).	`1000`
`seed`	`int`	Base random seed (default 42). Each simulation uses seed + i.	`42`
`n_jobs`	`int`	Number of parallel jobs. 1 = sequential, >1 = parallel via joblib (default 1).	`1`
`verbose`	`bool`	If True, show tqdm progress bar (default False).	`False`

Returns:

Type	Description
`MonteCarloResult`	MonteCarloResult with collected estimates and metrics.

Examples:

from bossanova import model
from simulation import generate_lm_data
result = run_monte_carlo(
    dgp_fn=generate_lm_data,
    dgp_params={"n": 100, "beta": [1.0, 2.0], "sigma": 1.0},
    fit_fn=lambda d: model("y ~ x1", d).fit().infer(),
    n_sims=500,
)
abs(result.bias("x1")) < 0.1
# True

run_power_analysis¶

run_power_analysis(*, formula: str, family: str, response_var: str, power: int | dict[str, Any], n: int, seed: int | None, coef: dict[str, float] | None, sigma: float, var_specs: dict[str, object]) -> pl.DataFrame

Run simulation-based power analysis for a model formula.

Pure-function extraction of the power analysis logic from the model class. Builds a sweep grid, constructs DGP and fit closures, and delegates to run_power_study().

Parameters:

Name	Type	Description	Default
`formula`	`str`	R-style model formula string.	required
`family`	`str`	Response distribution family name.	required
`response_var`	`str`	Name of the response variable.	required
`power`	`int \| dict[str, Any]`	Power config — int (n_sims) or dict with sweep axes and control params (`n_sims`, `alpha`, `conf_level`, `n_jobs`, `verbose`, `seed`).	required
`n`	`int`	Base sample size.	required
`seed`	`int \| None`	Random seed (used as fallback if power config has no seed).	required
`coef`	`dict[str, float] \| None`	Base coefficient values (or None for defaults).	required
`sigma`	`float`	Base residual standard deviation.	required
`var_specs`	`dict[str, object]`	Distribution specs for predictors (may include `Varying` instances for random effects).	required

Returns:

Type	Description
`DataFrame`	Polars DataFrame with one row per grid point per parameter.
`DataFrame`	Columns include sweep axes, power metrics, and diagnostics.

run_power_study¶

run_power_study(sweep_grid: list[dict[str, Any]], dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], fit_fn: Callable[[pl.DataFrame], Any], n_sims: int, seed: int, alpha: float = 0.05, ci_level: float = 0.95, n_jobs: int = 1, verbose: bool = False) -> pl.DataFrame

Run power analysis across a sweep grid.

For each grid point, calls run_monte_carlo() to run n_sims simulations, then extracts per-parameter metrics (power, coverage, bias, RMSE, etc.) and adds Wilson confidence intervals on power estimates. Returns a Polars DataFrame with one row per grid point per parameter.

Parameters:

Name	Type	Description	Default
`sweep_grid`	`list[dict[str, Any]]`	List of grid point dicts from `expand_sweep_grid()`. Each dict has keys `"n"`, `"coef"`, `"sigma"`, `"varying"`.	required
`dgp_fn`	`Callable..., [tuple[DataFrame, dict[str, Any]]]`	Data generating function that returns `(data, true_params)`. Must accept a `seed` parameter.	required
`fit_fn`	`Callable[[DataFrame], Any]`	Function that takes a DataFrame and returns a fitted model with a `.params` property.	required
`n_sims`	`int`	Number of simulations per grid point.	required
`seed`	`int`	Base random seed. Each grid point offsets by `i * n_sims`.	required
`alpha`	`float`	Significance level for power (default 0.05).	`0.05`
`ci_level`	`float`	Confidence level for Wilson CIs on power (default 0.95).	`0.95`
`n_jobs`	`int`	Number of parallel jobs for `run_monte_carlo()` (default 1).	`1`
`verbose`	`bool`	Show progress (default False).	`False`

Returns:

Type	Description
`DataFrame`	Polars DataFrame with one row per grid point per parameter.
`DataFrame`	Columns include sweep axes (`n`, `sigma`), parameter info
`DataFrame`	(`term`, `true_value`), power metrics (`power`,
`DataFrame`	`power_ci_lower`, `power_ci_upper`), and diagnostics
`DataFrame`	(`coverage`, `bias`, `rmse`, `mean_se`, `empirical_se`,
`DataFrame`	`n_sims`, `n_failed`).

simulate_responses_from_fit¶

simulate_responses_from_fit(fit: 'FitState', bundle: 'DataBundle', spec: 'ModelSpec', nsim: int, seed: int | None, varying: str = 'fitted') -> pl.DataFrame

Simulate new responses from a fitted model.

Uses the fitted coefficients and estimated residual variance to generate new response values. For gaussian models, adds normally distributed errors. For GLMs, generates from the appropriate distribution.

Parameters:

Name	Type	Description	Default
`fit`	`‘FitState’`	Fitted model state with coefficients, fitted values, and sigma.	required
`bundle`	`‘DataBundle’`	Data bundle with design matrix and RE metadata.	required
`spec`	`‘ModelSpec’`	Model specification with family and link information.	required
`nsim`	`int`	Number of simulations to generate.	required
`seed`	`int \| None`	Random seed for reproducibility, or None.	required
`varying`	`str`	How to handle random effects. `"fitted"` uses estimated BLUPs, `"sample"` draws new random effects each simulation.	`‘fitted’`

Returns:

Type	Description
`DataFrame`	DataFrame with columns `sim_1`, `sim_2`, ..., `sim_nsim`.

Modules¶

dgp¶

Data generating process (DGP) utilities for simulation studies.

Modules:

Name	Description
`generate`	Predictor generation utilities for DGP functions.
`glm`	Data generating process for generalized linear models.
`glmer`	Data generating process for generalized linear mixed models.
`lm`	Data generating process for linear models.
`lmer`	Data generating process for linear mixed models.

Functions:

Name	Description
`generate_glm_data`	Generate GLM data with known parameters.
`generate_glmer_data`	Generate GLMM data with known parameters.
`generate_lm_data`	Generate linear model data with known parameters.
`generate_lmer_data`	Generate linear mixed model data with known parameters.

Functions¶

generate_glm_data¶

generate_glm_data(n: int, beta: ArrayLike, family: FamilyType = 'gaussian', link: LinkType = None, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate GLM data with known parameters.

g(E[y]) = X @ beta g(E[y]) = X @ beta

where g is the link function determined by family/link.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of observations to generate.	required
`beta`	`ArrayLike`	True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1).	required
`family`	`FamilyType`	Response distribution: - “gaussian”: Normal distribution (default) - “binomial”: Binary outcomes (0/1) - “poisson”: Count data	`‘gaussian’`
`link`	`LinkType`	Link function. If None, uses canonical link for family: - gaussian: “identity” - binomial: “logit” - poisson: “log”	`None`
`sigma`	`float`	Residual SD (only used for gaussian family, default 1.0).	`1.0`
`x_type`	`Literal[‘gaussian’, ‘uniform’, ‘binary’]`	Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1}	`‘gaussian’`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix).	`None`
`seed`	`int \| None`	Random seed for reproducibility. If None, uses random state.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: polars DataFrame with columns y, x1, x2, ...
`tuple[DataFrame, dict]`	- true_params: dict with keys “beta”, “family”, “link”, and “sigma” (for gaussian)

Examples:

Old API (still works)::

# Logistic regression
data, params = generate_glm_data(
    n=200, beta=[0.0, 1.5], family="binomial"
)
data["y"].mean()  # Should be ~0.5 (balanced)

# Poisson regression
data, params = generate_glm_data(
    n=100, beta=[1.0, 0.5], family="poisson"
)

New API with distribution objects::

from bossanova.distributions import normal, uniform

data, params = generate_glm_data(
    n=200,
    beta=[0.0, 1.5],
    family="binomial",
    distributions={"x1": uniform(-2, 2)},
)

generate_glmer_data¶

generate_glmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, family: FamilyType = 'binomial', re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate GLMM data with known parameters.

g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i b_i ~ N(0, G)

where g is the canonical link function for the family.

Parameters:

Name	Type	Description	Default
`n_obs`	`int`	Total number of observations.	required
`n_groups`	`int`	Number of groups/clusters.	required
`beta`	`ArrayLike`	Fixed effect coefficients [intercept, slope1, ...].	required
`theta`	`ArrayLike`	Random effect parameters (see generate_lmer_data for details).	required
`family`	`FamilyType`	Response distribution: - “binomial”: Binary outcomes (0/1) with logit link - “poisson”: Count data with log link	`‘binomial’`
`re_structure`	`Literal[‘intercept’, ‘slope’, ‘both’]`	Random effects structure: - “intercept”: Random intercepts only - “slope”: Random slopes only - “both”: Random intercepts and slopes	`‘intercept’`
`obs_per_group`	`int \| None`	Observations per group. If None, uses n_obs // n_groups.	`None`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc.	`None`
`seed`	`int \| None`	Random seed for reproducibility.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: DataFrame with columns y, x1, group
`tuple[DataFrame, dict]`	- true_params: dict with beta, theta, family, and ranef

Examples:

Old API (still works)::

# Binomial GLMM with random intercepts
data, params = generate_glmer_data(
    n_obs=400, n_groups=40,
    beta=[0.0, 1.0],
    theta=[0.5],  # SD of random intercepts
    family="binomial",
)

New API with distribution objects::

from bossanova.distributions import uniform

data, params = generate_glmer_data(
    n_obs=400, n_groups=40,
    beta=[0.0, 1.0],
    theta=[0.5],
    family="binomial",
    distributions={"x1": uniform(-1, 1)},
)

generate_lm_data¶

generate_lm_data(n: int, beta: ArrayLike, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate linear model data with known parameters.

y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon epsilon ~ N(0, sigma^2)

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of observations to generate.	required
`beta`	`ArrayLike`	True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1).	required
`sigma`	`float`	Residual standard deviation (default 1.0).	`1.0`
`x_type`	`Literal[‘gaussian’, ‘uniform’, ‘binary’]`	Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1}	`‘gaussian’`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix).	`None`
`seed`	`int \| None`	Random seed for reproducibility. If None, uses random state.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: polars DataFrame with columns y, x1, x2, ...
`tuple[DataFrame, dict]`	- true_params: dict with keys “beta” (array) and “sigma” (float)

Examples:

Old API (still works)::

data, params = generate_lm_data(n=100, beta=[1.0, 2.0], sigma=0.5)
data.shape
# (100, 2)
params["beta"]
# array([1., 2.])

New API with distribution objects::

from bossanova.distributions import normal, categorical

data, params = generate_lm_data(
    n=100,
    beta=[0.0, 0.5],
    distributions={"x1": normal(0, 2)},
)

generate_lmer_data¶

generate_lmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, sigma: float = 1.0, re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate linear mixed model data with known parameters.

y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij b_i ~ N(0, G) epsilon_ij ~ N(0, sigma^2)

Parameters:

Name	Type	Description	Default
`n_obs`	`int`	Total number of observations. If obs_per_group is None, observations are distributed evenly across groups.	required
`n_groups`	`int`	Number of groups/clusters.	required
`beta`	`ArrayLike`	Fixed effect coefficients [intercept, slope1, ...].	required
`theta`	`ArrayLike`	Random effect parameters. Interpretation depends on re_structure: - “intercept”: [sigma_intercept] - SD of random intercepts - “slope”: [sigma_slope] - SD of random slopes (no random intercept) - “both”: [sigma_intercept, sigma_slope, corr] or [sigma_int, sigma_slope] If length 2, assumes uncorrelated (corr=0).	required
`sigma`	`float`	Residual standard deviation (default 1.0).	`1.0`
`re_structure`	`Literal[‘intercept’, ‘slope’, ‘both’]`	Random effects structure: - “intercept”: Random intercepts only (1	group) - “slope”: Random slopes only (0 + x1
`obs_per_group`	`int \| None`	Observations per group. If None, uses n_obs // n_groups.	`None`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc.	`None`
`seed`	`int \| None`	Random seed for reproducibility.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: DataFrame with columns y, x1, group
`tuple[DataFrame, dict]`	- true_params: dict with beta, theta, sigma, and ranef (actual random effects)

Examples:

Old API (still works)::

# Random intercepts model
data, params = generate_lmer_data(
    n_obs=200, n_groups=20,
    beta=[1.0, 0.5],
    theta=[0.5],  # SD of random intercepts
    sigma=1.0,
)

New API with distribution objects::

from bossanova.distributions import uniform

data, params = generate_lmer_data(
    n_obs=200, n_groups=20,
    beta=[1.0, 0.5],
    theta=[0.5],
    distributions={"x1": uniform(-1, 1)},
)

Modules¶

generate¶

Predictor generation utilities for DGP functions.

Provides helper functions to generate predictor values from Distribution specifications.

Functions:

Name	Description
`generate_predictors`	Generate predictor values from distribution specifications.

Functions¶

generate_predictors¶

generate_predictors(distributions: dict[str, Any], n: int, rng: np.random.Generator) -> dict[str, np.ndarray]

Generate predictor values from distribution specifications.

Parameters:

Name	Type	Description	Default
`distributions`	`dict[str, Any]`	Mapping of variable name to Distribution object. Supported types: Normal, Uniform, Categorical, Exponential, Gamma, T.	required
`n`	`int`	Number of observations.	required
`rng`	`Generator`	NumPy random number generator.	required

Returns:

Type	Description
`dict[str, ndarray]`	Dict mapping variable names to arrays.

Examples:

>>> from bossanova.distributions import normal, categorical
>>> import numpy as np
>>> rng = np.random.default_rng(42)
>>> result = generate_predictors(
...     {"x": normal(0, 2), "group": categorical(["A", "B"])},
...     n=10,
...     rng=rng,
... )
>>> result["x"].shape
(10,)
>>> set(result["group"]).issubset({"A", "B"})
True

glm¶

Data generating process for generalized linear models.

Functions:

Name	Description
`generate_glm_data`	Generate GLM data with known parameters.

Attributes¶

Classes¶

Functions¶

generate_glm_data¶

generate_glm_data(n: int, beta: ArrayLike, family: FamilyType = 'gaussian', link: LinkType = None, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate GLM data with known parameters.

g(E[y]) = X @ beta g(E[y]) = X @ beta

where g is the link function determined by family/link.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of observations to generate.	required
`beta`	`ArrayLike`	True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1).	required
`family`	`FamilyType`	Response distribution: - “gaussian”: Normal distribution (default) - “binomial”: Binary outcomes (0/1) - “poisson”: Count data	`‘gaussian’`
`link`	`LinkType`	Link function. If None, uses canonical link for family: - gaussian: “identity” - binomial: “logit” - poisson: “log”	`None`
`sigma`	`float`	Residual SD (only used for gaussian family, default 1.0).	`1.0`
`x_type`	`Literal[‘gaussian’, ‘uniform’, ‘binary’]`	Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1}	`‘gaussian’`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix).	`None`
`seed`	`int \| None`	Random seed for reproducibility. If None, uses random state.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: polars DataFrame with columns y, x1, x2, ...
`tuple[DataFrame, dict]`	- true_params: dict with keys “beta”, “family”, “link”, and “sigma” (for gaussian)

Examples:

Old API (still works)::

# Logistic regression
data, params = generate_glm_data(
    n=200, beta=[0.0, 1.5], family="binomial"
)
data["y"].mean()  # Should be ~0.5 (balanced)

# Poisson regression
data, params = generate_glm_data(
    n=100, beta=[1.0, 0.5], family="poisson"
)

New API with distribution objects::

from bossanova.distributions import normal, uniform

data, params = generate_glm_data(
    n=200,
    beta=[0.0, 1.5],
    family="binomial",
    distributions={"x1": uniform(-2, 2)},
)

glmer¶

Data generating process for generalized linear mixed models.

Functions:

Name	Description
`generate_glmer_data`	Generate GLMM data with known parameters.

Attributes¶

Classes¶

Functions¶

generate_glmer_data¶

generate_glmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, family: FamilyType = 'binomial', re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate GLMM data with known parameters.

g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i b_i ~ N(0, G)

where g is the canonical link function for the family.

Parameters:

Name	Type	Description	Default
`n_obs`	`int`	Total number of observations.	required
`n_groups`	`int`	Number of groups/clusters.	required
`beta`	`ArrayLike`	Fixed effect coefficients [intercept, slope1, ...].	required
`theta`	`ArrayLike`	Random effect parameters (see generate_lmer_data for details).	required
`family`	`FamilyType`	Response distribution: - “binomial”: Binary outcomes (0/1) with logit link - “poisson”: Count data with log link	`‘binomial’`
`re_structure`	`Literal[‘intercept’, ‘slope’, ‘both’]`	Random effects structure: - “intercept”: Random intercepts only - “slope”: Random slopes only - “both”: Random intercepts and slopes	`‘intercept’`
`obs_per_group`	`int \| None`	Observations per group. If None, uses n_obs // n_groups.	`None`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc.	`None`
`seed`	`int \| None`	Random seed for reproducibility.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: DataFrame with columns y, x1, group
`tuple[DataFrame, dict]`	- true_params: dict with beta, theta, family, and ranef

Examples:

Old API (still works)::

# Binomial GLMM with random intercepts
data, params = generate_glmer_data(
    n_obs=400, n_groups=40,
    beta=[0.0, 1.0],
    theta=[0.5],  # SD of random intercepts
    family="binomial",
)

New API with distribution objects::

from bossanova.distributions import uniform

data, params = generate_glmer_data(
    n_obs=400, n_groups=40,
    beta=[0.0, 1.0],
    theta=[0.5],
    family="binomial",
    distributions={"x1": uniform(-1, 1)},
)

lm¶

Data generating process for linear models.

Functions:

Name	Description
`generate_lm_data`	Generate linear model data with known parameters.

Attributes¶

Classes¶

Functions¶

generate_lm_data¶

generate_lm_data(n: int, beta: ArrayLike, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate linear model data with known parameters.

y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon epsilon ~ N(0, sigma^2)

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of observations to generate.	required
`beta`	`ArrayLike`	True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1).	required
`sigma`	`float`	Residual standard deviation (default 1.0).	`1.0`
`x_type`	`Literal[‘gaussian’, ‘uniform’, ‘binary’]`	Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1}	`‘gaussian’`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix).	`None`
`seed`	`int \| None`	Random seed for reproducibility. If None, uses random state.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: polars DataFrame with columns y, x1, x2, ...
`tuple[DataFrame, dict]`	- true_params: dict with keys “beta” (array) and “sigma” (float)

Examples:

Old API (still works)::

data, params = generate_lm_data(n=100, beta=[1.0, 2.0], sigma=0.5)
data.shape
# (100, 2)
params["beta"]
# array([1., 2.])

New API with distribution objects::

from bossanova.distributions import normal, categorical

data, params = generate_lm_data(
    n=100,
    beta=[0.0, 0.5],
    distributions={"x1": normal(0, 2)},
)

lmer¶

Data generating process for linear mixed models.

Functions:

Name	Description
`generate_lmer_data`	Generate linear mixed model data with known parameters.

Attributes¶

Classes¶

Functions¶

generate_lmer_data¶

generate_lmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, sigma: float = 1.0, re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]

Generate linear mixed model data with known parameters.

y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij b_i ~ N(0, G) epsilon_ij ~ N(0, sigma^2)

Parameters:

Name	Type	Description	Default
`n_obs`	`int`	Total number of observations. If obs_per_group is None, observations are distributed evenly across groups.	required
`n_groups`	`int`	Number of groups/clusters.	required
`beta`	`ArrayLike`	Fixed effect coefficients [intercept, slope1, ...].	required
`theta`	`ArrayLike`	Random effect parameters. Interpretation depends on re_structure: - “intercept”: [sigma_intercept] - SD of random intercepts - “slope”: [sigma_slope] - SD of random slopes (no random intercept) - “both”: [sigma_intercept, sigma_slope, corr] or [sigma_int, sigma_slope] If length 2, assumes uncorrelated (corr=0).	required
`sigma`	`float`	Residual standard deviation (default 1.0).	`1.0`
`re_structure`	`Literal[‘intercept’, ‘slope’, ‘both’]`	Random effects structure: - “intercept”: Random intercepts only (1	group) - “slope”: Random slopes only (0 + x1
`obs_per_group`	`int \| None`	Observations per group. If None, uses n_obs // n_groups.	`None`
`distributions`	`dict[str, Distribution] \| None`	Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc.	`None`
`seed`	`int \| None`	Random seed for reproducibility.	`None`

Returns:

Type	Description
`DataFrame`	A tuple of (data, true_params) where:
`dict`	- data: DataFrame with columns y, x1, group
`tuple[DataFrame, dict]`	- true_params: dict with beta, theta, sigma, and ranef (actual random effects)

Examples:

Old API (still works)::

# Random intercepts model
data, params = generate_lmer_data(
    n_obs=200, n_groups=20,
    beta=[1.0, 0.5],
    theta=[0.5],  # SD of random intercepts
    sigma=1.0,
)

New API with distribution objects::

from bossanova.distributions import uniform

data, params = generate_lmer_data(
    n_obs=200, n_groups=20,
    beta=[1.0, 0.5],
    theta=[0.5],
    distributions={"x1": uniform(-1, 1)},
)

harness¶

Monte Carlo simulation harness for parameter recovery studies.

Provides infrastructure for running simulation studies and collecting results for bias, coverage, and power analysis.

Classes:

Name	Description
`MonteCarloResult`	Results from a Monte Carlo simulation study.

Functions:

Name	Description
`compute_mc_iteration`	Execute a single Monte Carlo iteration.
`run_monte_carlo`	Run a Monte Carlo simulation study.

Classes¶

MonteCarloResult¶

Results from a Monte Carlo simulation study.

Stores estimates, standard errors, confidence intervals, and p-values from repeated simulations, along with ground truth parameters.

Created by: run_monte_carlo() Consumed by: User analysis code (bias, coverage, power methods) Augmented by: Never (immutable result container)

Attributes:

Name	Type	Description
`estimates`	`ndarray`	Array of shape (n_sims, n_params) with coefficient estimates.
`std_errors`	`ndarray`	Array of shape (n_sims, n_params) with standard errors.
`ci_lower`	`ndarray`	Array of shape (n_sims, n_params) with CI lower bounds.
`ci_upper`	`ndarray`	Array of shape (n_sims, n_params) with CI upper bounds.
`p_values`	`ndarray`	Array of shape (n_sims, n_params) with p-values.
`true_values`	`dict[str, float]`	Named mapping of true parameter values (e.g., {“Intercept”: 0.0, “x”: 0.5}).
`param_names`	`tuple[str, ...]`	Tuple of parameter names matching array columns.
`n_sims`	`int`	Number of successful simulations.
`n_failed`	`int`	Number of simulations that failed to converge.

Examples:

result = run_monte_carlo(
    dgp_fn=generate_lm_data,
    dgp_params={"n": 100, "beta": [1.0, 2.0], "sigma": 1.0},
    fit_fn=lambda d: model("y ~ x1", data=d).fit(),
    n_sims=500,
)
result.bias("x1")  # Should be close to 0
result.coverage("x1")  # Should be close to 0.95

Functions:

Name	Description
`bias`	Compute bias for a parameter: E[beta_hat] - beta_true.
`coverage`	Compute coverage probability for a parameter.
`empirical_se`	Compute empirical SE (SD of estimates) for a parameter.
`get_idx`	Get array index for a parameter.
`get_true_value`	Get true value for a parameter.
`mean_se`	Compute mean estimated standard error for a parameter.
`power`	Compute power for a parameter.
`rmse`	Compute RMSE for a parameter.
`summary`	Return summary DataFrame with all metrics for all parameters.
`type1_rate`	Compute Type I error rate for a parameter.

Attributes¶

ci_lower¶

ci_lower: np.ndarray = field(validator=is_ndarray)

ci_upper¶

ci_upper: np.ndarray = field(validator=is_ndarray)

estimates¶

estimates: np.ndarray = field(validator=is_ndarray)

n_failed¶

n_failed: int = field(default=0)

n_sims¶

n_sims: int = field(default=0)

p_values¶

p_values: np.ndarray = field(validator=is_ndarray)

param_names¶

param_names: tuple[str, ...] = field(converter=tuple, factory=tuple)

std_errors¶

std_errors: np.ndarray = field(validator=is_ndarray)

true_values¶

true_values: dict[str, float] = field(factory=dict)

Functions¶

bias¶

bias(param: str | int) -> float

Compute bias for a parameter: E[beta_hat] - beta_true.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`float`	Bias value.

coverage¶

coverage(param: str | int, level: float = 0.95) -> float

Compute coverage probability for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required
`level`	`float`	Confidence level (default 0.95). Currently only used for documentation; actual CIs come from the model.	`0.95`

Returns:

Type	Description
`float`	Coverage probability (0 to 1).

empirical_se¶

empirical_se(param: str | int) -> float

Compute empirical SE (SD of estimates) for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`float`	Empirical SE.

get_idx¶

get_idx(param: str | int) -> int

Get array index for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`int`	Integer index into the parameter arrays.

get_true_value¶

get_true_value(param: str | int) -> float

Get true value for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`float`	True parameter value.

mean_se¶

mean_se(param: str | int) -> float

Compute mean estimated standard error for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`float`	Mean SE across simulations.

power¶

power(param: str | int, alpha: float = 0.05) -> float

Compute power for a parameter.

This is the rejection rate when the true effect is non-zero. Use this for parameters where true_value != 0.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required
`alpha`	`float`	Significance level (default 0.05).	`0.05`

Returns:

Type	Description
`float`	Power (probability of correctly rejecting H0).

rmse¶

rmse(param: str | int) -> float

Compute RMSE for a parameter.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required

Returns:

Type	Description
`float`	Root mean squared error.

summary¶

summary() -> pl.DataFrame

Return summary DataFrame with all metrics for all parameters.

Returns:

Type	Description
`DataFrame`	DataFrame with columns: param, true_value, mean_est, bias,
`DataFrame`	rmse, mean_se, empirical_se, coverage, rejection_rate.

type1_rate¶

type1_rate(param: str | int, alpha: float = 0.05) -> float

Compute Type I error rate for a parameter.

This is the rejection rate when the true effect is zero. Use this for parameters where true_value = 0.

Parameters:

Name	Type	Description	Default
`param`	`str \| int`	Parameter name or index.	required
`alpha`	`float`	Significance level (default 0.05).	`0.05`

Returns:

Type	Description
`float`	Type I error rate (should be ~ alpha if correctly calibrated).

Functions¶

compute_mc_iteration¶

compute_mc_iteration(seed: int, dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], dgp_params: dict[str, Any], fit_fn: Callable[[pl.DataFrame], Any]) -> dict[str, Any] | None

Execute a single Monte Carlo iteration.

Standalone function (not a closure) for joblib serialization compatibility. Returns None if fitting fails.

Parameters:

Name	Type	Description	Default
`seed`	`int`	Random seed for this iteration.	required
`dgp_fn`	`Callable..., [tuple[DataFrame, dict[str, Any]]]`	Data generating function returning (data, true_params).	required
`dgp_params`	`dict[str, Any]`	Parameters to pass to dgp_fn (excluding seed).	required
`fit_fn`	`Callable[[DataFrame], Any]`	Function that takes a DataFrame and returns a fitted model.	required

Returns:

Type	Description
`dict[str, Any] \| None`	Dictionary with ‘params_df’ and ‘true_params’ on success, or
`dict[str, Any] \| None`	None if the model fit fails.

run_monte_carlo¶

run_monte_carlo(dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], dgp_params: dict[str, Any], fit_fn: Callable[[pl.DataFrame], Any], n_sims: int = 1000, seed: int = 42, n_jobs: int = 1, verbose: bool = False) -> MonteCarloResult

Run a Monte Carlo simulation study.

Generates data from a DGP, fits models, and collects results for parameter recovery analysis.

Parameters:

Name	Type	Description	Default
`dgp_fn`	`Callable..., [tuple[DataFrame, dict[str, Any]]]`	Data generating function that returns (data, true_params). Must accept a `seed` parameter.	required
`dgp_params`	`dict[str, Any]`	Parameters to pass to dgp_fn (excluding seed).	required
`fit_fn`	`Callable[[DataFrame], Any]`	Function that takes a DataFrame and returns a fitted model. The model must expose a `.params` property returning a Polars DataFrame with columns matching `Col.TERM`, `Col.ESTIMATE`, etc.	required
`n_sims`	`int`	Number of simulations to run (default 1000).	`1000`
`seed`	`int`	Base random seed (default 42). Each simulation uses seed + i.	`42`
`n_jobs`	`int`	Number of parallel jobs. 1 = sequential, >1 = parallel via joblib (default 1).	`1`
`verbose`	`bool`	If True, show tqdm progress bar (default False).	`False`

Returns:

Type	Description
`MonteCarloResult`	MonteCarloResult with collected estimates and metrics.

Examples:

from bossanova import model
from simulation import generate_lm_data
result = run_monte_carlo(
    dgp_fn=generate_lm_data,
    dgp_params={"n": 100, "beta": [1.0, 2.0], "sigma": 1.0},
    fit_fn=lambda d: model("y ~ x1", d).fit().infer(),
    n_sims=500,
)
abs(result.bias("x1")) < 0.1
# True

Modules¶

lifecycle¶

Simulate lifecycle orchestration.

Owns the three-branch simulate dispatch: power analysis, post-fit simulation, and pre-fit data generation. Called by model.simulate() so the model class stays a thin facade.

Classes:

Name	Description
`SimulateResult`	Immutable result of the simulate lifecycle.

Functions:

Name	Description
`execute_simulate`	Execute simulation: power analysis, post-fit sampling, or pre-fit generation.

Classes¶

SimulateResult¶

Immutable result of the simulate lifecycle.

Attributes:

Name	Type	Description
`mode`	`str`	One of `"power"`, `"simulate"`, or `"generate"`.
`simulations`	`DataFrame \| None`	Simulated response DataFrame (post-fit mode).
`generated_data`	`DataFrame \| None`	Generated DataFrame (pre-fit mode).
`power`	`MonteCarloResult \| None`	Monte Carlo power result (power mode).

Attributes¶

generated_data¶

generated_data: pl.DataFrame | None = None

mode¶

mode: str

power¶

power: MonteCarloResult | None = None

simulations¶

simulations: pl.DataFrame | None = None

Functions¶

execute_simulate¶

execute_simulate(spec: ModelSpec, bundle: DataBundle | None, fit: FitState | None, data: pl.DataFrame | None, formula: str, is_mixed: bool, n: int | None, nsim: int | None, seed: int | None, coef: dict[str, float] | None, sigma: float, varying: str, power: int | dict[str, Any] | None, var_specs: dict) -> SimulateResult

Execute simulation: power analysis, post-fit sampling, or pre-fit generation.

Parameters:

Name	Type	Description	Default
`spec`	`ModelSpec`	Model specification.	required
`bundle`	`DataBundle \| None`	Data bundle (None if pre-fit).	required
`fit`	`FitState \| None`	Fitted state (None if pre-fit).	required
`data`	`DataFrame \| None`	Current data (None if simulation-first).	required
`formula`	`str`	Formula string.	required
`is_mixed`	`bool`	Whether this is a mixed-effects model.	required
`n`	`int \| None`	Sample size for pre-fit generation or power analysis.	required
`nsim`	`int \| None`	Number of simulated response vectors (post-fit).	required
`seed`	`int \| None`	Random seed.	required
`coef`	`dict[str, float] \| None`	True coefficient values (pre-fit mode).	required
`sigma`	`float`	Residual SD (pre-fit gaussian).	required
`varying`	`str`	RE handling in post-fit (`"fitted"` or `"sample"`).	required
`power`	`int \| dict[str, Any] \| None`	Power analysis config (int or dict), or None.	required
`var_specs`	`dict`	Distribution specs for predictors.	required

Returns:

Type	Description
`SimulateResult`	SimulateResult with populated fields for the chosen mode.

metrics¶

Metrics for Monte Carlo simulation studies.

Functions:

Name	Description
`bias`	Compute bias: E[beta_hat] - beta_true.
`compute_wilson_ci`	Wilson score confidence interval for a binomial proportion.
`coverage`	Compute coverage probability.
`empirical_se`	Compute empirical standard error (SD of estimates).
`mean_se`	Compute mean of standard errors across simulations.
`rejection_rate`	Compute rejection rate (proportion of p-values < alpha).
`rmse`	Compute root mean squared error.

Functions¶

bias¶

bias(estimates: ArrayLike, true_value: float) -> float

Compute bias: E[beta_hat] - beta_true.

Parameters:

Name	Type	Description	Default
`estimates`	`ArrayLike`	Array of parameter estimates from simulations.	required
`true_value`	`float`	True parameter value.	required

Returns:

Type	Description
`float`	Bias (positive = overestimate, negative = underestimate).

Examples:

estimates = np.array([1.02, 0.98, 1.01, 0.99, 1.00])
bias(estimates, true_value=1.0)
# 0.0

compute_wilson_ci¶

compute_wilson_ci(p_hat: float, n: int, level: float = 0.95) -> tuple[float, float]

Wilson score confidence interval for a binomial proportion.

Better coverage than the Wald interval near p=0 or p=1.

Parameters:

Name	Type	Description	Default
`p_hat`	`float`	Observed proportion (0 to 1).	required
`n`	`int`	Number of trials.	required
`level`	`float`	Confidence level (default 0.95).	`0.95`

Returns:

Type	Description
`tuple[float, float]`	Tuple of (lower, upper) bounds.

Examples:

compute_wilson_ci(0.8, n=100, level=0.95)
# (0.711..., 0.868...)

coverage¶

coverage(ci_lower: ArrayLike, ci_upper: ArrayLike, true_value: float) -> float

Compute coverage probability.

Coverage = P(CI_lower <= beta_true <= CI_upper)

For a correctly calibrated 95% CI, coverage should be ~0.95.

Parameters:

Name	Type	Description	Default
`ci_lower`	`ArrayLike`	Array of CI lower bounds from simulations.	required
`ci_upper`	`ArrayLike`	Array of CI upper bounds from simulations.	required
`true_value`	`float`	True parameter value.	required

Returns:

Type	Description
`float`	Coverage probability (0 to 1).

Examples:

ci_lower = np.array([0.8, 0.9, 0.85, 0.95, 0.88])
ci_upper = np.array([1.2, 1.1, 1.15, 1.05, 1.12])
coverage(ci_lower, ci_upper, true_value=1.0)
# 1.0

empirical_se¶

empirical_se(estimates: ArrayLike) -> float

Compute empirical standard error (SD of estimates).

This is the “true” standard error as estimated from the simulation distribution.

Parameters:

Name	Type	Description	Default
`estimates`	`ArrayLike`	Array of parameter estimates from simulations.	required

Returns:

Type	Description
`float`	Empirical SE (standard deviation of estimates).

mean_se¶

mean_se(std_errors: ArrayLike) -> float

Compute mean of standard errors across simulations.

Parameters:

Name	Type	Description	Default
`std_errors`	`ArrayLike`	Array of SE estimates from simulations.	required

Returns:

Type	Description
`float`	Mean SE.

rejection_rate¶

rejection_rate(p_values: ArrayLike, alpha: float = 0.05) -> float

Compute rejection rate (proportion of p-values < alpha).

For null effects (beta=0), this estimates Type I error rate. For non-null effects (beta!=0), this estimates power.

Parameters:

Name	Type	Description	Default
`p_values`	`ArrayLike`	Array of p-values from simulations.	required
`alpha`	`float`	Significance level (default 0.05).	`0.05`

Returns:

Type	Description
`float`	Rejection rate (0 to 1).

Examples:

p_values = np.array([0.01, 0.06, 0.03, 0.12, 0.04])
rejection_rate(p_values, alpha=0.05)
# 0.6

rmse¶

rmse(estimates: ArrayLike, true_value: float) -> float

Compute root mean squared error.

RMSE = sqrt(E[(beta_hat - beta)^2]) = sqrt(Var(beta_hat) + Bias^2)

Parameters:

Name	Type	Description	Default
`estimates`	`ArrayLike`	Array of parameter estimates from simulations.	required
`true_value`	`float`	True parameter value.	required

Returns:

Type	Description
`float`	RMSE value.

Examples:

estimates = np.array([1.1, 0.9, 1.0, 1.0, 1.0])
rmse(estimates, true_value=1.0)
# 0.063...

model_sim¶

Model-level simulation operations.

Provides pre-fit data generation (generate_data_from_spec) and post-fit response simulation (simulate_responses_from_fit) as pure functions on containers. Extracted from model/core.py to enforce the “model is glue” boundary.

Functions:

Name	Description
`compute_mu_with_new_re`	Compute conditional mean with newly sampled random effects.
`generate_data_from_spec`	Generate a synthetic dataset from a simulation specification.
`simulate_responses_from_fit`	Simulate new responses from a fitted model.

Classes¶

Functions¶

compute_mu_with_new_re¶

compute_mu_with_new_re(bundle: 'DataBundle', fit: 'FitState', spec: 'ModelSpec', rng: np.random.Generator) -> np.ndarray

Compute conditional mean with newly sampled random effects.

Draws u_new ~ N(0, I), transforms via the relative covariance factor Lambda to get new BLUPs, then computes mu = X @ coef + Z @ (Lambda @ u_new * sigma).

Parameters:

Name	Type	Description	Default
`bundle`	`‘DataBundle’`	Data bundle with X, Z matrices and RE metadata.	required
`fit`	`‘FitState’`	Fit state with theta, sigma, and coefficients.	required
`spec`	`‘ModelSpec’`	Model specification with family and link info.	required
`rng`	`Generator`	NumPy random generator for reproducibility.	required

Returns:

Type	Description
`ndarray`	New mu values on the response scale, shape `(n,)`.

generate_data_from_spec¶

generate_data_from_spec(sim_spec: 'SimulationSpec', family: str, response_var: str) -> pl.DataFrame

Generate a synthetic dataset from a simulation specification.

Parameters:

Name	Type	Description	Default
`sim_spec`	`‘SimulationSpec’`	Simulation specification with sample size, distributions, coefficients, and random effect structure.	required
`family`	`str`	GLM family name (e.g. “gaussian”, “binomial”, “poisson”).	required
`response_var`	`str`	Name for the response column in the output DataFrame.	required

Returns:

Type	Description
`DataFrame`	Polars DataFrame with generated predictor and response columns.

simulate_responses_from_fit¶

simulate_responses_from_fit(fit: 'FitState', bundle: 'DataBundle', spec: 'ModelSpec', nsim: int, seed: int | None, varying: str = 'fitted') -> pl.DataFrame

Simulate new responses from a fitted model.

Parameters:

Name	Type	Description	Default
`fit`	`‘FitState’`	Fitted model state with coefficients, fitted values, and sigma.	required
`bundle`	`‘DataBundle’`	Data bundle with design matrix and RE metadata.	required
`spec`	`‘ModelSpec’`	Model specification with family and link information.	required
`nsim`	`int`	Number of simulations to generate.	required
`seed`	`int \| None`	Random seed for reproducibility, or None.	required
`varying`	`str`	How to handle random effects. `"fitted"` uses estimated BLUPs, `"sample"` draws new random effects each simulation.	`‘fitted’`

Returns:

Type	Description
`DataFrame`	DataFrame with columns `sim_1`, `sim_2`, ..., `sim_nsim`.

power¶

Power analysis via simulation sweep grids.

Provides utilities for constructing full-factorial parameter grids and running simulation-based power analyses across those grids. Each grid point triggers a Monte Carlo study via run_monte_carlo(), and the results are aggregated into a Polars DataFrame with per-parameter power, coverage, bias, RMSE, and other diagnostics.

Functions:

Name	Description
`expand_sweep_grid`	Full factorial grid from base DGP + power sweep overrides.
`run_power_analysis`	Run simulation-based power analysis for a model formula.
`run_power_study`	Run power analysis across a sweep grid.

Classes¶

Functions¶

expand_sweep_grid¶

expand_sweep_grid(base_n: int, base_coef: dict[str, float], base_sigma: float, base_varying: dict[str, dict[str, Any]], power_config: dict[str, Any]) -> list[dict[str, Any]]

Full factorial grid from base DGP + power sweep overrides.

Uses itertools.product for Cartesian expansion. Handles nested coef dict merging (e.g., sweeping one coefficient while keeping others fixed).

Parameters:

Name	Type	Description	Default
`base_n`	`int`	Base sample size.	required
`base_coef`	`dict[str, float]`	Base coefficient values (e.g., `{"Intercept": 0.0, "x": 0.5}`).	required
`base_sigma`	`float`	Base residual standard deviation.	required
`base_varying`	`dict[str, dict[str, Any]]`	Base random effect specs (empty dict for fixed-effects models).	required
`power_config`	`dict[str, Any]`	Sweep configuration dict. Keys are sweep axes (`"n"`, `"coef"`, `"sigma"`, `"varying"`). Non-sweep control keys (`"n_sims"`, `"alpha"`, `"conf_level"`, `"n_jobs"`, `"verbose"`, `"seed"`) are ignored.	required

Returns:

Type	Description
`list[dict[str, Any]]`	List of grid point dicts, each with keys: `"n"`, `"coef"`,
`list[dict[str, Any]]`	`"sigma"`, `"varying"`.

Examples:

>>> grid = expand_sweep_grid(
...     base_n=100, base_coef={"x": 0.5}, base_sigma=1.0,
...     base_varying={},
...     power_config={"n": [50, 100, 200]},
... )
>>> len(grid)
3
>>> grid[0]["n"]
50

run_power_analysis¶

run_power_analysis(*, formula: str, family: str, response_var: str, power: int | dict[str, Any], n: int, seed: int | None, coef: dict[str, float] | None, sigma: float, var_specs: dict[str, object]) -> pl.DataFrame

Run simulation-based power analysis for a model formula.

Pure-function extraction of the power analysis logic from the model class. Builds a sweep grid, constructs DGP and fit closures, and delegates to run_power_study().

Parameters:

Name	Type	Description	Default
`formula`	`str`	R-style model formula string.	required
`family`	`str`	Response distribution family name.	required
`response_var`	`str`	Name of the response variable.	required
`power`	`int \| dict[str, Any]`	Power config — int (n_sims) or dict with sweep axes and control params (`n_sims`, `alpha`, `conf_level`, `n_jobs`, `verbose`, `seed`).	required
`n`	`int`	Base sample size.	required
`seed`	`int \| None`	Random seed (used as fallback if power config has no seed).	required
`coef`	`dict[str, float] \| None`	Base coefficient values (or None for defaults).	required
`sigma`	`float`	Base residual standard deviation.	required
`var_specs`	`dict[str, object]`	Distribution specs for predictors (may include `Varying` instances for random effects).	required

Returns:

Type	Description
`DataFrame`	Polars DataFrame with one row per grid point per parameter.
`DataFrame`	Columns include sweep axes, power metrics, and diagnostics.

run_power_study¶

run_power_study(sweep_grid: list[dict[str, Any]], dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], fit_fn: Callable[[pl.DataFrame], Any], n_sims: int, seed: int, alpha: float = 0.05, ci_level: float = 0.95, n_jobs: int = 1, verbose: bool = False) -> pl.DataFrame

Run power analysis across a sweep grid.

Parameters:

Name	Type	Description	Default
`sweep_grid`	`list[dict[str, Any]]`	List of grid point dicts from `expand_sweep_grid()`. Each dict has keys `"n"`, `"coef"`, `"sigma"`, `"varying"`.	required
`dgp_fn`	`Callable..., [tuple[DataFrame, dict[str, Any]]]`	Data generating function that returns `(data, true_params)`. Must accept a `seed` parameter.	required
`fit_fn`	`Callable[[DataFrame], Any]`	Function that takes a DataFrame and returns a fitted model with a `.params` property.	required
`n_sims`	`int`	Number of simulations per grid point.	required
`seed`	`int`	Base random seed. Each grid point offsets by `i * n_sims`.	required
`alpha`	`float`	Significance level for power (default 0.05).	`0.05`
`ci_level`	`float`	Confidence level for Wilson CIs on power (default 0.95).	`0.95`
`n_jobs`	`int`	Number of parallel jobs for `run_monte_carlo()` (default 1).	`1`
`verbose`	`bool`	Show progress (default False).	`False`

Returns:

Type	Description
`DataFrame`	Polars DataFrame with one row per grid point per parameter.
`DataFrame`	Columns include sweep axes (`n`, `sigma`), parameter info
`DataFrame`	(`term`, `true_value`), power metrics (`power`,
`DataFrame`	`power_ci_lower`, `power_ci_upper`), and diagnostics
`DataFrame`	(`coverage`, `bias`, `rmse`, `mean_se`, `empirical_se`,
`DataFrame`	`n_sims`, `n_failed`).