Simulation utilities for parameter recovery and Monte Carlo studies.
Call chain:
model.simulate(n=...) -> generate_{lm,glm,lmer,glmer}_data() (pre-fit DGP)
model.simulate(nsim=...) -> simulate_responses_from_fit() (post-fit parametric bootstrap)
run_power_study(formula, ...) -> run_power_analysis() -> expand_sweep_grid()Classes:
| Name | Description |
|---|---|
MonteCarloResult | Results from a Monte Carlo simulation study. |
SimulateResult | Immutable result of the simulate lifecycle. |
Functions:
| Name | Description |
|---|---|
bias | Compute bias: E[beta_hat] - beta_true. |
compute_mc_iteration | Execute a single Monte Carlo iteration. |
compute_mu_with_new_re | Compute conditional mean with newly sampled random effects. |
compute_wilson_ci | Wilson score confidence interval for a binomial proportion. |
coverage | Compute coverage probability. |
empirical_se | Compute empirical standard error (SD of estimates). |
execute_simulate | Execute simulation: power analysis, post-fit sampling, or pre-fit generation. |
expand_sweep_grid | Full factorial grid from base DGP + power sweep overrides. |
generate_data_from_spec | Generate a synthetic dataset from a simulation specification. |
generate_glm_data | Generate GLM data with known parameters. |
generate_glmer_data | Generate GLMM data with known parameters. |
generate_lm_data | Generate linear model data with known parameters. |
generate_lmer_data | Generate linear mixed model data with known parameters. |
mean_se | Compute mean of standard errors across simulations. |
rejection_rate | Compute rejection rate (proportion of p-values < alpha). |
rmse | Compute root mean squared error. |
run_monte_carlo | Run a Monte Carlo simulation study. |
run_power_analysis | Run simulation-based power analysis for a model formula. |
run_power_study | Run power analysis across a sweep grid. |
simulate_responses_from_fit | Simulate new responses from a fitted model. |
Modules:
| Name | Description |
|---|---|
dgp | Data generating process (DGP) utilities for simulation studies. |
harness | Monte Carlo simulation harness for parameter recovery studies. |
lifecycle | Simulate lifecycle orchestration. |
metrics | Metrics for Monte Carlo simulation studies. |
model_sim | Model-level simulation operations. |
power | Power analysis via simulation sweep grids. |
Classes¶
MonteCarloResult¶
Results from a Monte Carlo simulation study.
Stores estimates, standard errors, confidence intervals, and p-values from repeated simulations, along with ground truth parameters.
Created by: run_monte_carlo() Consumed by: User analysis code (bias, coverage, power methods) Augmented by: Never (immutable result container)
Attributes:
| Name | Type | Description |
|---|---|---|
estimates | ndarray | Array of shape (n_sims, n_params) with coefficient estimates. |
std_errors | ndarray | Array of shape (n_sims, n_params) with standard errors. |
ci_lower | ndarray | Array of shape (n_sims, n_params) with CI lower bounds. |
ci_upper | ndarray | Array of shape (n_sims, n_params) with CI upper bounds. |
p_values | ndarray | Array of shape (n_sims, n_params) with p-values. |
true_values | dict[str, float] | Named mapping of true parameter values (e.g., {“Intercept”: 0.0, “x”: 0.5}). |
param_names | tuple[str, ...] | Tuple of parameter names matching array columns. |
n_sims | int | Number of successful simulations. |
n_failed | int | Number of simulations that failed to converge. |
Examples:
result = run_monte_carlo(
dgp_fn=generate_lm_data,
dgp_params={"n": 100, "beta": [1.0, 2.0], "sigma": 1.0},
fit_fn=lambda d: model("y ~ x1", data=d).fit(),
n_sims=500,
)
result.bias("x1") # Should be close to 0
result.coverage("x1") # Should be close to 0.95Functions:
| Name | Description |
|---|---|
bias | Compute bias for a parameter: E[beta_hat] - beta_true. |
coverage | Compute coverage probability for a parameter. |
empirical_se | Compute empirical SE (SD of estimates) for a parameter. |
get_idx | Get array index for a parameter. |
get_true_value | Get true value for a parameter. |
mean_se | Compute mean estimated standard error for a parameter. |
power | Compute power for a parameter. |
rmse | Compute RMSE for a parameter. |
summary | Return summary DataFrame with all metrics for all parameters. |
type1_rate | Compute Type I error rate for a parameter. |
Attributes¶
ci_lower¶
ci_lower: np.ndarray = field(validator=is_ndarray)ci_upper¶
ci_upper: np.ndarray = field(validator=is_ndarray)estimates¶
estimates: np.ndarray = field(validator=is_ndarray)n_failed¶
n_failed: int = field(default=0)n_sims¶
n_sims: int = field(default=0)p_values¶
p_values: np.ndarray = field(validator=is_ndarray)param_names¶
param_names: tuple[str, ...] = field(converter=tuple, factory=tuple)std_errors¶
std_errors: np.ndarray = field(validator=is_ndarray)true_values¶
true_values: dict[str, float] = field(factory=dict)Functions¶
bias¶
bias(param: str | int) -> floatCompute bias for a parameter: E[beta_hat] - beta_true.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
float | Bias value. |
coverage¶
coverage(param: str | int, level: float = 0.95) -> floatCompute coverage probability for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
level | float | Confidence level (default 0.95). Currently only used for documentation; actual CIs come from the model. | 0.95 |
Returns:
| Type | Description |
|---|---|
float | Coverage probability (0 to 1). |
empirical_se¶
empirical_se(param: str | int) -> floatCompute empirical SE (SD of estimates) for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
float | Empirical SE. |
get_idx¶
get_idx(param: str | int) -> intGet array index for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
int | Integer index into the parameter arrays. |
get_true_value¶
get_true_value(param: str | int) -> floatGet true value for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
float | True parameter value. |
mean_se¶
mean_se(param: str | int) -> floatCompute mean estimated standard error for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
float | Mean SE across simulations. |
power¶
power(param: str | int, alpha: float = 0.05) -> floatCompute power for a parameter.
This is the rejection rate when the true effect is non-zero. Use this for parameters where true_value != 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
alpha | float | Significance level (default 0.05). | 0.05 |
Returns:
| Type | Description |
|---|---|
float | Power (probability of correctly rejecting H0). |
rmse¶
rmse(param: str | int) -> floatCompute RMSE for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
float | Root mean squared error. |
summary¶
summary() -> pl.DataFrameReturn summary DataFrame with all metrics for all parameters.
Returns:
| Type | Description |
|---|---|
DataFrame | DataFrame with columns: param, true_value, mean_est, bias, |
DataFrame | rmse, mean_se, empirical_se, coverage, rejection_rate. |
type1_rate¶
type1_rate(param: str | int, alpha: float = 0.05) -> floatCompute Type I error rate for a parameter.
This is the rejection rate when the true effect is zero. Use this for parameters where true_value = 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
alpha | float | Significance level (default 0.05). | 0.05 |
Returns:
| Type | Description |
|---|---|
float | Type I error rate (should be ~ alpha if correctly calibrated). |
SimulateResult¶
Immutable result of the simulate lifecycle.
Attributes:
| Name | Type | Description |
|---|---|---|
mode | str | One of "power", "simulate", or "generate". |
simulations | DataFrame | None | Simulated response DataFrame (post-fit mode). |
generated_data | DataFrame | None | Generated DataFrame (pre-fit mode). |
power | MonteCarloResult | None | Monte Carlo power result (power mode). |
Attributes¶
generated_data¶
generated_data: pl.DataFrame | None = Nonemode¶
mode: strpower¶
power: MonteCarloResult | None = Nonesimulations¶
simulations: pl.DataFrame | None = NoneFunctions¶
bias¶
bias(estimates: ArrayLike, true_value: float) -> floatCompute bias: E[beta_hat] - beta_true.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimates | ArrayLike | Array of parameter estimates from simulations. | required |
true_value | float | True parameter value. | required |
Returns:
| Type | Description |
|---|---|
float | Bias (positive = overestimate, negative = underestimate). |
Examples:
estimates = np.array([1.02, 0.98, 1.01, 0.99, 1.00])
bias(estimates, true_value=1.0)
# 0.0compute_mc_iteration¶
compute_mc_iteration(seed: int, dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], dgp_params: dict[str, Any], fit_fn: Callable[[pl.DataFrame], Any]) -> dict[str, Any] | NoneExecute a single Monte Carlo iteration.
Standalone function (not a closure) for joblib serialization compatibility. Returns None if fitting fails.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed | int | Random seed for this iteration. | required |
dgp_fn | Callable..., [tuple[DataFrame, dict[str, Any]]] | Data generating function returning (data, true_params). | required |
dgp_params | dict[str, Any] | Parameters to pass to dgp_fn (excluding seed). | required |
fit_fn | Callable[[DataFrame], Any] | Function that takes a DataFrame and returns a fitted model. | required |
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None | Dictionary with ‘params_df’ and ‘true_params’ on success, or |
dict[str, Any] | None | None if the model fit fails. |
compute_mu_with_new_re¶
compute_mu_with_new_re(bundle: 'DataBundle', fit: 'FitState', spec: 'ModelSpec', rng: np.random.Generator) -> np.ndarrayCompute conditional mean with newly sampled random effects.
Draws u_new ~ N(0, I), transforms via the relative covariance factor
Lambda to get new BLUPs, then computes mu = X @ coef + Z @ (Lambda @ u_new * sigma).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bundle | ‘DataBundle’ | Data bundle with X, Z matrices and RE metadata. | required |
fit | ‘FitState’ | Fit state with theta, sigma, and coefficients. | required |
spec | ‘ModelSpec’ | Model specification with family and link info. | required |
rng | Generator | NumPy random generator for reproducibility. | required |
Returns:
| Type | Description |
|---|---|
ndarray | New mu values on the response scale, shape (n,). |
compute_wilson_ci¶
compute_wilson_ci(p_hat: float, n: int, level: float = 0.95) -> tuple[float, float]Wilson score confidence interval for a binomial proportion.
Better coverage than the Wald interval near p=0 or p=1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p_hat | float | Observed proportion (0 to 1). | required |
n | int | Number of trials. | required |
level | float | Confidence level (default 0.95). | 0.95 |
Returns:
| Type | Description |
|---|---|
tuple[float, float] | Tuple of (lower, upper) bounds. |
Examples:
compute_wilson_ci(0.8, n=100, level=0.95)
# (0.711..., 0.868...)coverage¶
coverage(ci_lower: ArrayLike, ci_upper: ArrayLike, true_value: float) -> floatCompute coverage probability.
Coverage = P(CI_lower <= beta_true <= CI_upper)
For a correctly calibrated 95% CI, coverage should be ~0.95.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ci_lower | ArrayLike | Array of CI lower bounds from simulations. | required |
ci_upper | ArrayLike | Array of CI upper bounds from simulations. | required |
true_value | float | True parameter value. | required |
Returns:
| Type | Description |
|---|---|
float | Coverage probability (0 to 1). |
Examples:
ci_lower = np.array([0.8, 0.9, 0.85, 0.95, 0.88])
ci_upper = np.array([1.2, 1.1, 1.15, 1.05, 1.12])
coverage(ci_lower, ci_upper, true_value=1.0)
# 1.0empirical_se¶
empirical_se(estimates: ArrayLike) -> floatCompute empirical standard error (SD of estimates).
This is the “true” standard error as estimated from the simulation distribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimates | ArrayLike | Array of parameter estimates from simulations. | required |
Returns:
| Type | Description |
|---|---|
float | Empirical SE (standard deviation of estimates). |
execute_simulate¶
execute_simulate(spec: ModelSpec, bundle: DataBundle | None, fit: FitState | None, data: pl.DataFrame | None, formula: str, is_mixed: bool, n: int | None, nsim: int | None, seed: int | None, coef: dict[str, float] | None, sigma: float, varying: str, power: int | dict[str, Any] | None, var_specs: dict) -> SimulateResultExecute simulation: power analysis, post-fit sampling, or pre-fit generation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
spec | ModelSpec | Model specification. | required |
bundle | DataBundle | None | Data bundle (None if pre-fit). | required |
fit | FitState | None | Fitted state (None if pre-fit). | required |
data | DataFrame | None | Current data (None if simulation-first). | required |
formula | str | Formula string. | required |
is_mixed | bool | Whether this is a mixed-effects model. | required |
n | int | None | Sample size for pre-fit generation or power analysis. | required |
nsim | int | None | Number of simulated response vectors (post-fit). | required |
seed | int | None | Random seed. | required |
coef | dict[str, float] | None | True coefficient values (pre-fit mode). | required |
sigma | float | Residual SD (pre-fit gaussian). | required |
varying | str | RE handling in post-fit ("fitted" or "sample"). | required |
power | int | dict[str, Any] | None | Power analysis config (int or dict), or None. | required |
var_specs | dict | Distribution specs for predictors. | required |
Returns:
| Type | Description |
|---|---|
SimulateResult | SimulateResult with populated fields for the chosen mode. |
expand_sweep_grid¶
expand_sweep_grid(base_n: int, base_coef: dict[str, float], base_sigma: float, base_varying: dict[str, dict[str, Any]], power_config: dict[str, Any]) -> list[dict[str, Any]]Full factorial grid from base DGP + power sweep overrides.
Uses itertools.product for Cartesian expansion. Handles nested
coef dict merging (e.g., sweeping one coefficient while keeping
others fixed).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_n | int | Base sample size. | required |
base_coef | dict[str, float] | Base coefficient values (e.g., {"Intercept": 0.0, "x": 0.5}). | required |
base_sigma | float | Base residual standard deviation. | required |
base_varying | dict[str, dict[str, Any]] | Base random effect specs (empty dict for fixed-effects models). | required |
power_config | dict[str, Any] | Sweep configuration dict. Keys are sweep axes ("n", "coef", "sigma", "varying"). Non-sweep control keys ("n_sims", "alpha", "conf_level", "n_jobs", "verbose", "seed") are ignored. | required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | List of grid point dicts, each with keys: "n", "coef", |
list[dict[str, Any]] | "sigma", "varying". |
Examples:
>>> grid = expand_sweep_grid(
... base_n=100, base_coef={"x": 0.5}, base_sigma=1.0,
... base_varying={},
... power_config={"n": [50, 100, 200]},
... )
>>> len(grid)
3
>>> grid[0]["n"]
50generate_data_from_spec¶
generate_data_from_spec(sim_spec: 'SimulationSpec', family: str, response_var: str) -> pl.DataFrameGenerate a synthetic dataset from a simulation specification.
Implements the data-generating process (DGP): samples predictors from their distributions, constructs the design matrix, computes the linear predictor, and draws response values from the appropriate family.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sim_spec | ‘SimulationSpec’ | Simulation specification with sample size, distributions, coefficients, and random effect structure. | required |
family | str | GLM family name (e.g. “gaussian”, “binomial”, “poisson”). | required |
response_var | str | Name for the response column in the output DataFrame. | required |
Returns:
| Type | Description |
|---|---|
DataFrame | Polars DataFrame with generated predictor and response columns. |
generate_glm_data¶
generate_glm_data(n: int, beta: ArrayLike, family: FamilyType = 'gaussian', link: LinkType = None, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate GLM data with known parameters.
g(E[y]) = X @ beta g(E[y]) = X @ beta
where g is the link function determined by family/link.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n | int | Number of observations to generate. | required |
beta | ArrayLike | True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1). | required |
family | FamilyType | Response distribution: - “gaussian”: Normal distribution (default) - “binomial”: Binary outcomes (0/1) - “poisson”: Count data | ‘gaussian’ |
link | LinkType | Link function. If None, uses canonical link for family: - gaussian: “identity” - binomial: “logit” - poisson: “log” | None |
sigma | float | Residual SD (only used for gaussian family, default 1.0). | 1.0 |
x_type | Literal[‘gaussian’, ‘uniform’, ‘binary’] | Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1} | ‘gaussian’ |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix). | None |
seed | int | None | Random seed for reproducibility. If None, uses random state. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: polars DataFrame with columns y, x1, x2, ... |
tuple[DataFrame, dict] | - true_params: dict with keys “beta”, “family”, “link”, and “sigma” (for gaussian) |
Examples:
Old API (still works)::
# Logistic regression
data, params = generate_glm_data(
n=200, beta=[0.0, 1.5], family="binomial"
)
data["y"].mean() # Should be ~0.5 (balanced)
# Poisson regression
data, params = generate_glm_data(
n=100, beta=[1.0, 0.5], family="poisson"
)New API with distribution objects::
from bossanova.distributions import normal, uniform
data, params = generate_glm_data(
n=200,
beta=[0.0, 1.5],
family="binomial",
distributions={"x1": uniform(-2, 2)},
)generate_glmer_data¶
generate_glmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, family: FamilyType = 'binomial', re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate GLMM data with known parameters.
g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i b_i ~ N(0, G)
where g is the canonical link function for the family.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_obs | int | Total number of observations. | required |
n_groups | int | Number of groups/clusters. | required |
beta | ArrayLike | Fixed effect coefficients [intercept, slope1, ...]. | required |
theta | ArrayLike | Random effect parameters (see generate_lmer_data for details). | required |
family | FamilyType | Response distribution: - “binomial”: Binary outcomes (0/1) with logit link - “poisson”: Count data with log link | ‘binomial’ |
re_structure | Literal[‘intercept’, ‘slope’, ‘both’] | Random effects structure: - “intercept”: Random intercepts only - “slope”: Random slopes only - “both”: Random intercepts and slopes | ‘intercept’ |
obs_per_group | int | None | Observations per group. If None, uses n_obs // n_groups. | None |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc. | None |
seed | int | None | Random seed for reproducibility. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: DataFrame with columns y, x1, group |
tuple[DataFrame, dict] | - true_params: dict with beta, theta, family, and ranef |
Examples:
Old API (still works)::
# Binomial GLMM with random intercepts
data, params = generate_glmer_data(
n_obs=400, n_groups=40,
beta=[0.0, 1.0],
theta=[0.5], # SD of random intercepts
family="binomial",
)New API with distribution objects::
from bossanova.distributions import uniform
data, params = generate_glmer_data(
n_obs=400, n_groups=40,
beta=[0.0, 1.0],
theta=[0.5],
family="binomial",
distributions={"x1": uniform(-1, 1)},
)generate_lm_data¶
generate_lm_data(n: int, beta: ArrayLike, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate linear model data with known parameters.
y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon epsilon ~ N(0, sigma^2)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n | int | Number of observations to generate. | required |
beta | ArrayLike | True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1). | required |
sigma | float | Residual standard deviation (default 1.0). | 1.0 |
x_type | Literal[‘gaussian’, ‘uniform’, ‘binary’] | Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1} | ‘gaussian’ |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix). | None |
seed | int | None | Random seed for reproducibility. If None, uses random state. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: polars DataFrame with columns y, x1, x2, ... |
tuple[DataFrame, dict] | - true_params: dict with keys “beta” (array) and “sigma” (float) |
Examples:
Old API (still works)::
data, params = generate_lm_data(n=100, beta=[1.0, 2.0], sigma=0.5)
data.shape
# (100, 2)
params["beta"]
# array([1., 2.])New API with distribution objects::
from bossanova.distributions import normal, categorical
data, params = generate_lm_data(
n=100,
beta=[0.0, 0.5],
distributions={"x1": normal(0, 2)},
)generate_lmer_data¶
generate_lmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, sigma: float = 1.0, re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate linear mixed model data with known parameters.
y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij b_i ~ N(0, G) epsilon_ij ~ N(0, sigma^2)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_obs | int | Total number of observations. If obs_per_group is None, observations are distributed evenly across groups. | required |
n_groups | int | Number of groups/clusters. | required |
beta | ArrayLike | Fixed effect coefficients [intercept, slope1, ...]. | required |
theta | ArrayLike | Random effect parameters. Interpretation depends on re_structure: - “intercept”: [sigma_intercept] - SD of random intercepts - “slope”: [sigma_slope] - SD of random slopes (no random intercept) - “both”: [sigma_intercept, sigma_slope, corr] or [sigma_int, sigma_slope] If length 2, assumes uncorrelated (corr=0). | required |
sigma | float | Residual standard deviation (default 1.0). | 1.0 |
re_structure | Literal[‘intercept’, ‘slope’, ‘both’] | Random effects structure: - “intercept”: Random intercepts only (1 | group) - “slope”: Random slopes only (0 + x1 |
obs_per_group | int | None | Observations per group. If None, uses n_obs // n_groups. | None |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc. | None |
seed | int | None | Random seed for reproducibility. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: DataFrame with columns y, x1, group |
tuple[DataFrame, dict] | - true_params: dict with beta, theta, sigma, and ranef (actual random effects) |
Examples:
Old API (still works)::
# Random intercepts model
data, params = generate_lmer_data(
n_obs=200, n_groups=20,
beta=[1.0, 0.5],
theta=[0.5], # SD of random intercepts
sigma=1.0,
)New API with distribution objects::
from bossanova.distributions import uniform
data, params = generate_lmer_data(
n_obs=200, n_groups=20,
beta=[1.0, 0.5],
theta=[0.5],
distributions={"x1": uniform(-1, 1)},
)mean_se¶
mean_se(std_errors: ArrayLike) -> floatCompute mean of standard errors across simulations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
std_errors | ArrayLike | Array of SE estimates from simulations. | required |
Returns:
| Type | Description |
|---|---|
float | Mean SE. |
rejection_rate¶
rejection_rate(p_values: ArrayLike, alpha: float = 0.05) -> floatCompute rejection rate (proportion of p-values < alpha).
For null effects (beta=0), this estimates Type I error rate. For non-null effects (beta!=0), this estimates power.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p_values | ArrayLike | Array of p-values from simulations. | required |
alpha | float | Significance level (default 0.05). | 0.05 |
Returns:
| Type | Description |
|---|---|
float | Rejection rate (0 to 1). |
Examples:
p_values = np.array([0.01, 0.06, 0.03, 0.12, 0.04])
rejection_rate(p_values, alpha=0.05)
# 0.6rmse¶
rmse(estimates: ArrayLike, true_value: float) -> floatCompute root mean squared error.
RMSE = sqrt(E[(beta_hat - beta)^2]) = sqrt(Var(beta_hat) + Bias^2)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimates | ArrayLike | Array of parameter estimates from simulations. | required |
true_value | float | True parameter value. | required |
Returns:
| Type | Description |
|---|---|
float | RMSE value. |
Examples:
estimates = np.array([1.1, 0.9, 1.0, 1.0, 1.0])
rmse(estimates, true_value=1.0)
# 0.063...run_monte_carlo¶
run_monte_carlo(dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], dgp_params: dict[str, Any], fit_fn: Callable[[pl.DataFrame], Any], n_sims: int = 1000, seed: int = 42, n_jobs: int = 1, verbose: bool = False) -> MonteCarloResultRun a Monte Carlo simulation study.
Generates data from a DGP, fits models, and collects results for parameter recovery analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dgp_fn | Callable..., [tuple[DataFrame, dict[str, Any]]] | Data generating function that returns (data, true_params). Must accept a seed parameter. | required |
dgp_params | dict[str, Any] | Parameters to pass to dgp_fn (excluding seed). | required |
fit_fn | Callable[[DataFrame], Any] | Function that takes a DataFrame and returns a fitted model. The model must expose a .params property returning a Polars DataFrame with columns matching Col.TERM, Col.ESTIMATE, etc. | required |
n_sims | int | Number of simulations to run (default 1000). | 1000 |
seed | int | Base random seed (default 42). Each simulation uses seed + i. | 42 |
n_jobs | int | Number of parallel jobs. 1 = sequential, >1 = parallel via joblib (default 1). | 1 |
verbose | bool | If True, show tqdm progress bar (default False). | False |
Returns:
| Type | Description |
|---|---|
MonteCarloResult | MonteCarloResult with collected estimates and metrics. |
Examples:
from bossanova import model
from simulation import generate_lm_data
result = run_monte_carlo(
dgp_fn=generate_lm_data,
dgp_params={"n": 100, "beta": [1.0, 2.0], "sigma": 1.0},
fit_fn=lambda d: model("y ~ x1", d).fit().infer(),
n_sims=500,
)
abs(result.bias("x1")) < 0.1
# Truerun_power_analysis¶
run_power_analysis(*, formula: str, family: str, response_var: str, power: int | dict[str, Any], n: int, seed: int | None, coef: dict[str, float] | None, sigma: float, var_specs: dict[str, object]) -> pl.DataFrameRun simulation-based power analysis for a model formula.
Pure-function extraction of the power analysis logic from the model class.
Builds a sweep grid, constructs DGP and fit closures, and delegates to
run_power_study().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula | str | R-style model formula string. | required |
family | str | Response distribution family name. | required |
response_var | str | Name of the response variable. | required |
power | int | dict[str, Any] | Power config — int (n_sims) or dict with sweep axes and control params (n_sims, alpha, conf_level, n_jobs, verbose, seed). | required |
n | int | Base sample size. | required |
seed | int | None | Random seed (used as fallback if power config has no seed). | required |
coef | dict[str, float] | None | Base coefficient values (or None for defaults). | required |
sigma | float | Base residual standard deviation. | required |
var_specs | dict[str, object] | Distribution specs for predictors (may include Varying instances for random effects). | required |
Returns:
| Type | Description |
|---|---|
DataFrame | Polars DataFrame with one row per grid point per parameter. |
DataFrame | Columns include sweep axes, power metrics, and diagnostics. |
run_power_study¶
run_power_study(sweep_grid: list[dict[str, Any]], dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], fit_fn: Callable[[pl.DataFrame], Any], n_sims: int, seed: int, alpha: float = 0.05, ci_level: float = 0.95, n_jobs: int = 1, verbose: bool = False) -> pl.DataFrameRun power analysis across a sweep grid.
For each grid point, calls run_monte_carlo() to run n_sims
simulations, then extracts per-parameter metrics (power, coverage,
bias, RMSE, etc.) and adds Wilson confidence intervals on power
estimates. Returns a Polars DataFrame with one row per grid point
per parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sweep_grid | list[dict[str, Any]] | List of grid point dicts from expand_sweep_grid(). Each dict has keys "n", "coef", "sigma", "varying". | required |
dgp_fn | Callable..., [tuple[DataFrame, dict[str, Any]]] | Data generating function that returns (data, true_params). Must accept a seed parameter. | required |
fit_fn | Callable[[DataFrame], Any] | Function that takes a DataFrame and returns a fitted model with a .params property. | required |
n_sims | int | Number of simulations per grid point. | required |
seed | int | Base random seed. Each grid point offsets by i * n_sims. | required |
alpha | float | Significance level for power (default 0.05). | 0.05 |
ci_level | float | Confidence level for Wilson CIs on power (default 0.95). | 0.95 |
n_jobs | int | Number of parallel jobs for run_monte_carlo() (default 1). | 1 |
verbose | bool | Show progress (default False). | False |
Returns:
| Type | Description |
|---|---|
DataFrame | Polars DataFrame with one row per grid point per parameter. |
DataFrame | Columns include sweep axes (n, sigma), parameter info |
DataFrame | (term, true_value), power metrics (power, |
DataFrame | power_ci_lower, power_ci_upper), and diagnostics |
DataFrame | (coverage, bias, rmse, mean_se, empirical_se, |
DataFrame | n_sims, n_failed). |
simulate_responses_from_fit¶
simulate_responses_from_fit(fit: 'FitState', bundle: 'DataBundle', spec: 'ModelSpec', nsim: int, seed: int | None, varying: str = 'fitted') -> pl.DataFrameSimulate new responses from a fitted model.
Uses the fitted coefficients and estimated residual variance to generate new response values. For gaussian models, adds normally distributed errors. For GLMs, generates from the appropriate distribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fit | ‘FitState’ | Fitted model state with coefficients, fitted values, and sigma. | required |
bundle | ‘DataBundle’ | Data bundle with design matrix and RE metadata. | required |
spec | ‘ModelSpec’ | Model specification with family and link information. | required |
nsim | int | Number of simulations to generate. | required |
seed | int | None | Random seed for reproducibility, or None. | required |
varying | str | How to handle random effects. "fitted" uses estimated BLUPs, "sample" draws new random effects each simulation. | ‘fitted’ |
Returns:
| Type | Description |
|---|---|
DataFrame | DataFrame with columns sim_1, sim_2, ..., sim_nsim. |
Modules¶
dgp¶
Data generating process (DGP) utilities for simulation studies.
Modules:
| Name | Description |
|---|---|
generate | Predictor generation utilities for DGP functions. |
glm | Data generating process for generalized linear models. |
glmer | Data generating process for generalized linear mixed models. |
lm | Data generating process for linear models. |
lmer | Data generating process for linear mixed models. |
Functions:
| Name | Description |
|---|---|
generate_glm_data | Generate GLM data with known parameters. |
generate_glmer_data | Generate GLMM data with known parameters. |
generate_lm_data | Generate linear model data with known parameters. |
generate_lmer_data | Generate linear mixed model data with known parameters. |
Functions¶
generate_glm_data¶
generate_glm_data(n: int, beta: ArrayLike, family: FamilyType = 'gaussian', link: LinkType = None, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate GLM data with known parameters.
g(E[y]) = X @ beta g(E[y]) = X @ beta
where g is the link function determined by family/link.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n | int | Number of observations to generate. | required |
beta | ArrayLike | True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1). | required |
family | FamilyType | Response distribution: - “gaussian”: Normal distribution (default) - “binomial”: Binary outcomes (0/1) - “poisson”: Count data | ‘gaussian’ |
link | LinkType | Link function. If None, uses canonical link for family: - gaussian: “identity” - binomial: “logit” - poisson: “log” | None |
sigma | float | Residual SD (only used for gaussian family, default 1.0). | 1.0 |
x_type | Literal[‘gaussian’, ‘uniform’, ‘binary’] | Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1} | ‘gaussian’ |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix). | None |
seed | int | None | Random seed for reproducibility. If None, uses random state. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: polars DataFrame with columns y, x1, x2, ... |
tuple[DataFrame, dict] | - true_params: dict with keys “beta”, “family”, “link”, and “sigma” (for gaussian) |
Examples:
Old API (still works)::
# Logistic regression
data, params = generate_glm_data(
n=200, beta=[0.0, 1.5], family="binomial"
)
data["y"].mean() # Should be ~0.5 (balanced)
# Poisson regression
data, params = generate_glm_data(
n=100, beta=[1.0, 0.5], family="poisson"
)New API with distribution objects::
from bossanova.distributions import normal, uniform
data, params = generate_glm_data(
n=200,
beta=[0.0, 1.5],
family="binomial",
distributions={"x1": uniform(-2, 2)},
)generate_glmer_data¶
generate_glmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, family: FamilyType = 'binomial', re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate GLMM data with known parameters.
g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i b_i ~ N(0, G)
where g is the canonical link function for the family.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_obs | int | Total number of observations. | required |
n_groups | int | Number of groups/clusters. | required |
beta | ArrayLike | Fixed effect coefficients [intercept, slope1, ...]. | required |
theta | ArrayLike | Random effect parameters (see generate_lmer_data for details). | required |
family | FamilyType | Response distribution: - “binomial”: Binary outcomes (0/1) with logit link - “poisson”: Count data with log link | ‘binomial’ |
re_structure | Literal[‘intercept’, ‘slope’, ‘both’] | Random effects structure: - “intercept”: Random intercepts only - “slope”: Random slopes only - “both”: Random intercepts and slopes | ‘intercept’ |
obs_per_group | int | None | Observations per group. If None, uses n_obs // n_groups. | None |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc. | None |
seed | int | None | Random seed for reproducibility. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: DataFrame with columns y, x1, group |
tuple[DataFrame, dict] | - true_params: dict with beta, theta, family, and ranef |
Examples:
Old API (still works)::
# Binomial GLMM with random intercepts
data, params = generate_glmer_data(
n_obs=400, n_groups=40,
beta=[0.0, 1.0],
theta=[0.5], # SD of random intercepts
family="binomial",
)New API with distribution objects::
from bossanova.distributions import uniform
data, params = generate_glmer_data(
n_obs=400, n_groups=40,
beta=[0.0, 1.0],
theta=[0.5],
family="binomial",
distributions={"x1": uniform(-1, 1)},
)generate_lm_data¶
generate_lm_data(n: int, beta: ArrayLike, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate linear model data with known parameters.
y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon epsilon ~ N(0, sigma^2)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n | int | Number of observations to generate. | required |
beta | ArrayLike | True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1). | required |
sigma | float | Residual standard deviation (default 1.0). | 1.0 |
x_type | Literal[‘gaussian’, ‘uniform’, ‘binary’] | Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1} | ‘gaussian’ |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix). | None |
seed | int | None | Random seed for reproducibility. If None, uses random state. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: polars DataFrame with columns y, x1, x2, ... |
tuple[DataFrame, dict] | - true_params: dict with keys “beta” (array) and “sigma” (float) |
Examples:
Old API (still works)::
data, params = generate_lm_data(n=100, beta=[1.0, 2.0], sigma=0.5)
data.shape
# (100, 2)
params["beta"]
# array([1., 2.])New API with distribution objects::
from bossanova.distributions import normal, categorical
data, params = generate_lm_data(
n=100,
beta=[0.0, 0.5],
distributions={"x1": normal(0, 2)},
)generate_lmer_data¶
generate_lmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, sigma: float = 1.0, re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate linear mixed model data with known parameters.
y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij b_i ~ N(0, G) epsilon_ij ~ N(0, sigma^2)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_obs | int | Total number of observations. If obs_per_group is None, observations are distributed evenly across groups. | required |
n_groups | int | Number of groups/clusters. | required |
beta | ArrayLike | Fixed effect coefficients [intercept, slope1, ...]. | required |
theta | ArrayLike | Random effect parameters. Interpretation depends on re_structure: - “intercept”: [sigma_intercept] - SD of random intercepts - “slope”: [sigma_slope] - SD of random slopes (no random intercept) - “both”: [sigma_intercept, sigma_slope, corr] or [sigma_int, sigma_slope] If length 2, assumes uncorrelated (corr=0). | required |
sigma | float | Residual standard deviation (default 1.0). | 1.0 |
re_structure | Literal[‘intercept’, ‘slope’, ‘both’] | Random effects structure: - “intercept”: Random intercepts only (1 | group) - “slope”: Random slopes only (0 + x1 |
obs_per_group | int | None | Observations per group. If None, uses n_obs // n_groups. | None |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc. | None |
seed | int | None | Random seed for reproducibility. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: DataFrame with columns y, x1, group |
tuple[DataFrame, dict] | - true_params: dict with beta, theta, sigma, and ranef (actual random effects) |
Examples:
Old API (still works)::
# Random intercepts model
data, params = generate_lmer_data(
n_obs=200, n_groups=20,
beta=[1.0, 0.5],
theta=[0.5], # SD of random intercepts
sigma=1.0,
)New API with distribution objects::
from bossanova.distributions import uniform
data, params = generate_lmer_data(
n_obs=200, n_groups=20,
beta=[1.0, 0.5],
theta=[0.5],
distributions={"x1": uniform(-1, 1)},
)Modules¶
generate¶
Predictor generation utilities for DGP functions.
Provides helper functions to generate predictor values from Distribution specifications.
Functions:
| Name | Description |
|---|---|
generate_predictors | Generate predictor values from distribution specifications. |
Functions¶
generate_predictors¶
generate_predictors(distributions: dict[str, Any], n: int, rng: np.random.Generator) -> dict[str, np.ndarray]Generate predictor values from distribution specifications.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
distributions | dict[str, Any] | Mapping of variable name to Distribution object. Supported types: Normal, Uniform, Categorical, Exponential, Gamma, T. | required |
n | int | Number of observations. | required |
rng | Generator | NumPy random number generator. | required |
Returns:
| Type | Description |
|---|---|
dict[str, ndarray] | Dict mapping variable names to arrays. |
Examples:
>>> from bossanova.distributions import normal, categorical
>>> import numpy as np
>>> rng = np.random.default_rng(42)
>>> result = generate_predictors(
... {"x": normal(0, 2), "group": categorical(["A", "B"])},
... n=10,
... rng=rng,
... )
>>> result["x"].shape
(10,)
>>> set(result["group"]).issubset({"A", "B"})
Trueglm¶
Data generating process for generalized linear models.
Functions:
| Name | Description |
|---|---|
generate_glm_data | Generate GLM data with known parameters. |
Attributes¶
Classes¶
Functions¶
generate_glm_data¶
generate_glm_data(n: int, beta: ArrayLike, family: FamilyType = 'gaussian', link: LinkType = None, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate GLM data with known parameters.
g(E[y]) = X @ beta g(E[y]) = X @ beta
where g is the link function determined by family/link.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n | int | Number of observations to generate. | required |
beta | ArrayLike | True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1). | required |
family | FamilyType | Response distribution: - “gaussian”: Normal distribution (default) - “binomial”: Binary outcomes (0/1) - “poisson”: Count data | ‘gaussian’ |
link | LinkType | Link function. If None, uses canonical link for family: - gaussian: “identity” - binomial: “logit” - poisson: “log” | None |
sigma | float | Residual SD (only used for gaussian family, default 1.0). | 1.0 |
x_type | Literal[‘gaussian’, ‘uniform’, ‘binary’] | Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1} | ‘gaussian’ |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix). | None |
seed | int | None | Random seed for reproducibility. If None, uses random state. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: polars DataFrame with columns y, x1, x2, ... |
tuple[DataFrame, dict] | - true_params: dict with keys “beta”, “family”, “link”, and “sigma” (for gaussian) |
Examples:
Old API (still works)::
# Logistic regression
data, params = generate_glm_data(
n=200, beta=[0.0, 1.5], family="binomial"
)
data["y"].mean() # Should be ~0.5 (balanced)
# Poisson regression
data, params = generate_glm_data(
n=100, beta=[1.0, 0.5], family="poisson"
)New API with distribution objects::
from bossanova.distributions import normal, uniform
data, params = generate_glm_data(
n=200,
beta=[0.0, 1.5],
family="binomial",
distributions={"x1": uniform(-2, 2)},
)glmer¶
Data generating process for generalized linear mixed models.
Functions:
| Name | Description |
|---|---|
generate_glmer_data | Generate GLMM data with known parameters. |
Attributes¶
Classes¶
Functions¶
generate_glmer_data¶
generate_glmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, family: FamilyType = 'binomial', re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate GLMM data with known parameters.
g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i g(E[y_ij|b_i]) = X_ij @ beta + Z_ij @ b_i b_i ~ N(0, G)
where g is the canonical link function for the family.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_obs | int | Total number of observations. | required |
n_groups | int | Number of groups/clusters. | required |
beta | ArrayLike | Fixed effect coefficients [intercept, slope1, ...]. | required |
theta | ArrayLike | Random effect parameters (see generate_lmer_data for details). | required |
family | FamilyType | Response distribution: - “binomial”: Binary outcomes (0/1) with logit link - “poisson”: Count data with log link | ‘binomial’ |
re_structure | Literal[‘intercept’, ‘slope’, ‘both’] | Random effects structure: - “intercept”: Random intercepts only - “slope”: Random slopes only - “both”: Random intercepts and slopes | ‘intercept’ |
obs_per_group | int | None | Observations per group. If None, uses n_obs // n_groups. | None |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc. | None |
seed | int | None | Random seed for reproducibility. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: DataFrame with columns y, x1, group |
tuple[DataFrame, dict] | - true_params: dict with beta, theta, family, and ranef |
Examples:
Old API (still works)::
# Binomial GLMM with random intercepts
data, params = generate_glmer_data(
n_obs=400, n_groups=40,
beta=[0.0, 1.0],
theta=[0.5], # SD of random intercepts
family="binomial",
)New API with distribution objects::
from bossanova.distributions import uniform
data, params = generate_glmer_data(
n_obs=400, n_groups=40,
beta=[0.0, 1.0],
theta=[0.5],
family="binomial",
distributions={"x1": uniform(-1, 1)},
)lm¶
Data generating process for linear models.
Functions:
| Name | Description |
|---|---|
generate_lm_data | Generate linear model data with known parameters. |
Attributes¶
Classes¶
Functions¶
generate_lm_data¶
generate_lm_data(n: int, beta: ArrayLike, sigma: float = 1.0, x_type: Literal['gaussian', 'uniform', 'binary'] = 'gaussian', distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate linear model data with known parameters.
y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon y = intercept + beta[1]*x1 + beta[2]*x2 + ... + epsilon epsilon ~ N(0, sigma^2)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n | int | Number of observations to generate. | required |
beta | ArrayLike | True coefficients as [intercept, slope1, slope2, ...]. Length determines number of predictors (len(beta) - 1). | required |
sigma | float | Residual standard deviation (default 1.0). | 1.0 |
x_type | Literal[‘gaussian’, ‘uniform’, ‘binary’] | Distribution for predictors: - “gaussian”: Standard normal N(0, 1) - “uniform”: Uniform on [0, 1] - “binary”: Bernoulli(0.5), values in {0, 1} | ‘gaussian’ |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override x_type for the specified variables. Variable names should match x1, x2, etc. Categorical distributions are stored as separate columns (not in X matrix). | None |
seed | int | None | Random seed for reproducibility. If None, uses random state. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: polars DataFrame with columns y, x1, x2, ... |
tuple[DataFrame, dict] | - true_params: dict with keys “beta” (array) and “sigma” (float) |
Examples:
Old API (still works)::
data, params = generate_lm_data(n=100, beta=[1.0, 2.0], sigma=0.5)
data.shape
# (100, 2)
params["beta"]
# array([1., 2.])New API with distribution objects::
from bossanova.distributions import normal, categorical
data, params = generate_lm_data(
n=100,
beta=[0.0, 0.5],
distributions={"x1": normal(0, 2)},
)lmer¶
Data generating process for linear mixed models.
Functions:
| Name | Description |
|---|---|
generate_lmer_data | Generate linear mixed model data with known parameters. |
Attributes¶
Classes¶
Functions¶
generate_lmer_data¶
generate_lmer_data(n_obs: int, n_groups: int, beta: ArrayLike, theta: ArrayLike, sigma: float = 1.0, re_structure: Literal['intercept', 'slope', 'both'] = 'intercept', obs_per_group: int | None = None, distributions: dict[str, Distribution] | None = None, seed: int | None = None) -> tuple[pl.DataFrame, dict]Generate linear mixed model data with known parameters.
y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij y_ij = X_ij @ beta + Z_ij @ b_i + epsilon_ij b_i ~ N(0, G) epsilon_ij ~ N(0, sigma^2)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_obs | int | Total number of observations. If obs_per_group is None, observations are distributed evenly across groups. | required |
n_groups | int | Number of groups/clusters. | required |
beta | ArrayLike | Fixed effect coefficients [intercept, slope1, ...]. | required |
theta | ArrayLike | Random effect parameters. Interpretation depends on re_structure: - “intercept”: [sigma_intercept] - SD of random intercepts - “slope”: [sigma_slope] - SD of random slopes (no random intercept) - “both”: [sigma_intercept, sigma_slope, corr] or [sigma_int, sigma_slope] If length 2, assumes uncorrelated (corr=0). | required |
sigma | float | Residual standard deviation (default 1.0). | 1.0 |
re_structure | Literal[‘intercept’, ‘slope’, ‘both’] | Random effects structure: - “intercept”: Random intercepts only (1 | group) - “slope”: Random slopes only (0 + x1 |
obs_per_group | int | None | Observations per group. If None, uses n_obs // n_groups. | None |
distributions | dict[str, Distribution] | None | Mapping of variable names to Distribution objects. When provided, these override standard normal for the specified variables. Variable names should match x1, x2, etc. | None |
seed | int | None | Random seed for reproducibility. | None |
Returns:
| Type | Description |
|---|---|
DataFrame | A tuple of (data, true_params) where: |
dict | - data: DataFrame with columns y, x1, group |
tuple[DataFrame, dict] | - true_params: dict with beta, theta, sigma, and ranef (actual random effects) |
Examples:
Old API (still works)::
# Random intercepts model
data, params = generate_lmer_data(
n_obs=200, n_groups=20,
beta=[1.0, 0.5],
theta=[0.5], # SD of random intercepts
sigma=1.0,
)New API with distribution objects::
from bossanova.distributions import uniform
data, params = generate_lmer_data(
n_obs=200, n_groups=20,
beta=[1.0, 0.5],
theta=[0.5],
distributions={"x1": uniform(-1, 1)},
)harness¶
Monte Carlo simulation harness for parameter recovery studies.
Provides infrastructure for running simulation studies and collecting results for bias, coverage, and power analysis.
Classes:
| Name | Description |
|---|---|
MonteCarloResult | Results from a Monte Carlo simulation study. |
Functions:
| Name | Description |
|---|---|
compute_mc_iteration | Execute a single Monte Carlo iteration. |
run_monte_carlo | Run a Monte Carlo simulation study. |
Classes¶
MonteCarloResult¶
Results from a Monte Carlo simulation study.
Stores estimates, standard errors, confidence intervals, and p-values from repeated simulations, along with ground truth parameters.
Created by: run_monte_carlo() Consumed by: User analysis code (bias, coverage, power methods) Augmented by: Never (immutable result container)
Attributes:
| Name | Type | Description |
|---|---|---|
estimates | ndarray | Array of shape (n_sims, n_params) with coefficient estimates. |
std_errors | ndarray | Array of shape (n_sims, n_params) with standard errors. |
ci_lower | ndarray | Array of shape (n_sims, n_params) with CI lower bounds. |
ci_upper | ndarray | Array of shape (n_sims, n_params) with CI upper bounds. |
p_values | ndarray | Array of shape (n_sims, n_params) with p-values. |
true_values | dict[str, float] | Named mapping of true parameter values (e.g., {“Intercept”: 0.0, “x”: 0.5}). |
param_names | tuple[str, ...] | Tuple of parameter names matching array columns. |
n_sims | int | Number of successful simulations. |
n_failed | int | Number of simulations that failed to converge. |
Examples:
result = run_monte_carlo(
dgp_fn=generate_lm_data,
dgp_params={"n": 100, "beta": [1.0, 2.0], "sigma": 1.0},
fit_fn=lambda d: model("y ~ x1", data=d).fit(),
n_sims=500,
)
result.bias("x1") # Should be close to 0
result.coverage("x1") # Should be close to 0.95Functions:
| Name | Description |
|---|---|
bias | Compute bias for a parameter: E[beta_hat] - beta_true. |
coverage | Compute coverage probability for a parameter. |
empirical_se | Compute empirical SE (SD of estimates) for a parameter. |
get_idx | Get array index for a parameter. |
get_true_value | Get true value for a parameter. |
mean_se | Compute mean estimated standard error for a parameter. |
power | Compute power for a parameter. |
rmse | Compute RMSE for a parameter. |
summary | Return summary DataFrame with all metrics for all parameters. |
type1_rate | Compute Type I error rate for a parameter. |
Attributes¶
ci_lower¶
ci_lower: np.ndarray = field(validator=is_ndarray)ci_upper¶
ci_upper: np.ndarray = field(validator=is_ndarray)estimates¶
estimates: np.ndarray = field(validator=is_ndarray)n_failed¶
n_failed: int = field(default=0)n_sims¶
n_sims: int = field(default=0)p_values¶
p_values: np.ndarray = field(validator=is_ndarray)param_names¶
param_names: tuple[str, ...] = field(converter=tuple, factory=tuple)std_errors¶
std_errors: np.ndarray = field(validator=is_ndarray)true_values¶
true_values: dict[str, float] = field(factory=dict)Functions¶
bias¶
bias(param: str | int) -> floatCompute bias for a parameter: E[beta_hat] - beta_true.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
float | Bias value. |
coverage¶
coverage(param: str | int, level: float = 0.95) -> floatCompute coverage probability for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
level | float | Confidence level (default 0.95). Currently only used for documentation; actual CIs come from the model. | 0.95 |
Returns:
| Type | Description |
|---|---|
float | Coverage probability (0 to 1). |
empirical_se¶
empirical_se(param: str | int) -> floatCompute empirical SE (SD of estimates) for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
float | Empirical SE. |
get_idx¶
get_idx(param: str | int) -> intGet array index for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
int | Integer index into the parameter arrays. |
get_true_value¶
get_true_value(param: str | int) -> floatGet true value for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
float | True parameter value. |
mean_se¶
mean_se(param: str | int) -> floatCompute mean estimated standard error for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
float | Mean SE across simulations. |
power¶
power(param: str | int, alpha: float = 0.05) -> floatCompute power for a parameter.
This is the rejection rate when the true effect is non-zero. Use this for parameters where true_value != 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
alpha | float | Significance level (default 0.05). | 0.05 |
Returns:
| Type | Description |
|---|---|
float | Power (probability of correctly rejecting H0). |
rmse¶
rmse(param: str | int) -> floatCompute RMSE for a parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
Returns:
| Type | Description |
|---|---|
float | Root mean squared error. |
summary¶
summary() -> pl.DataFrameReturn summary DataFrame with all metrics for all parameters.
Returns:
| Type | Description |
|---|---|
DataFrame | DataFrame with columns: param, true_value, mean_est, bias, |
DataFrame | rmse, mean_se, empirical_se, coverage, rejection_rate. |
type1_rate¶
type1_rate(param: str | int, alpha: float = 0.05) -> floatCompute Type I error rate for a parameter.
This is the rejection rate when the true effect is zero. Use this for parameters where true_value = 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param | str | int | Parameter name or index. | required |
alpha | float | Significance level (default 0.05). | 0.05 |
Returns:
| Type | Description |
|---|---|
float | Type I error rate (should be ~ alpha if correctly calibrated). |
Functions¶
compute_mc_iteration¶
compute_mc_iteration(seed: int, dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], dgp_params: dict[str, Any], fit_fn: Callable[[pl.DataFrame], Any]) -> dict[str, Any] | NoneExecute a single Monte Carlo iteration.
Standalone function (not a closure) for joblib serialization compatibility. Returns None if fitting fails.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed | int | Random seed for this iteration. | required |
dgp_fn | Callable..., [tuple[DataFrame, dict[str, Any]]] | Data generating function returning (data, true_params). | required |
dgp_params | dict[str, Any] | Parameters to pass to dgp_fn (excluding seed). | required |
fit_fn | Callable[[DataFrame], Any] | Function that takes a DataFrame and returns a fitted model. | required |
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None | Dictionary with ‘params_df’ and ‘true_params’ on success, or |
dict[str, Any] | None | None if the model fit fails. |
run_monte_carlo¶
run_monte_carlo(dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], dgp_params: dict[str, Any], fit_fn: Callable[[pl.DataFrame], Any], n_sims: int = 1000, seed: int = 42, n_jobs: int = 1, verbose: bool = False) -> MonteCarloResultRun a Monte Carlo simulation study.
Generates data from a DGP, fits models, and collects results for parameter recovery analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dgp_fn | Callable..., [tuple[DataFrame, dict[str, Any]]] | Data generating function that returns (data, true_params). Must accept a seed parameter. | required |
dgp_params | dict[str, Any] | Parameters to pass to dgp_fn (excluding seed). | required |
fit_fn | Callable[[DataFrame], Any] | Function that takes a DataFrame and returns a fitted model. The model must expose a .params property returning a Polars DataFrame with columns matching Col.TERM, Col.ESTIMATE, etc. | required |
n_sims | int | Number of simulations to run (default 1000). | 1000 |
seed | int | Base random seed (default 42). Each simulation uses seed + i. | 42 |
n_jobs | int | Number of parallel jobs. 1 = sequential, >1 = parallel via joblib (default 1). | 1 |
verbose | bool | If True, show tqdm progress bar (default False). | False |
Returns:
| Type | Description |
|---|---|
MonteCarloResult | MonteCarloResult with collected estimates and metrics. |
Examples:
from bossanova import model
from simulation import generate_lm_data
result = run_monte_carlo(
dgp_fn=generate_lm_data,
dgp_params={"n": 100, "beta": [1.0, 2.0], "sigma": 1.0},
fit_fn=lambda d: model("y ~ x1", d).fit().infer(),
n_sims=500,
)
abs(result.bias("x1")) < 0.1
# TrueModules¶
lifecycle¶
Simulate lifecycle orchestration.
Owns the three-branch simulate dispatch: power analysis, post-fit simulation,
and pre-fit data generation. Called by model.simulate() so the model class
stays a thin facade.
Classes:
| Name | Description |
|---|---|
SimulateResult | Immutable result of the simulate lifecycle. |
Functions:
| Name | Description |
|---|---|
execute_simulate | Execute simulation: power analysis, post-fit sampling, or pre-fit generation. |
Classes¶
SimulateResult¶
Immutable result of the simulate lifecycle.
Attributes:
| Name | Type | Description |
|---|---|---|
mode | str | One of "power", "simulate", or "generate". |
simulations | DataFrame | None | Simulated response DataFrame (post-fit mode). |
generated_data | DataFrame | None | Generated DataFrame (pre-fit mode). |
power | MonteCarloResult | None | Monte Carlo power result (power mode). |
Attributes¶
generated_data¶
generated_data: pl.DataFrame | None = Nonemode¶
mode: strpower¶
power: MonteCarloResult | None = Nonesimulations¶
simulations: pl.DataFrame | None = NoneFunctions¶
execute_simulate¶
execute_simulate(spec: ModelSpec, bundle: DataBundle | None, fit: FitState | None, data: pl.DataFrame | None, formula: str, is_mixed: bool, n: int | None, nsim: int | None, seed: int | None, coef: dict[str, float] | None, sigma: float, varying: str, power: int | dict[str, Any] | None, var_specs: dict) -> SimulateResultExecute simulation: power analysis, post-fit sampling, or pre-fit generation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
spec | ModelSpec | Model specification. | required |
bundle | DataBundle | None | Data bundle (None if pre-fit). | required |
fit | FitState | None | Fitted state (None if pre-fit). | required |
data | DataFrame | None | Current data (None if simulation-first). | required |
formula | str | Formula string. | required |
is_mixed | bool | Whether this is a mixed-effects model. | required |
n | int | None | Sample size for pre-fit generation or power analysis. | required |
nsim | int | None | Number of simulated response vectors (post-fit). | required |
seed | int | None | Random seed. | required |
coef | dict[str, float] | None | True coefficient values (pre-fit mode). | required |
sigma | float | Residual SD (pre-fit gaussian). | required |
varying | str | RE handling in post-fit ("fitted" or "sample"). | required |
power | int | dict[str, Any] | None | Power analysis config (int or dict), or None. | required |
var_specs | dict | Distribution specs for predictors. | required |
Returns:
| Type | Description |
|---|---|
SimulateResult | SimulateResult with populated fields for the chosen mode. |
metrics¶
Metrics for Monte Carlo simulation studies.
Functions:
| Name | Description |
|---|---|
bias | Compute bias: E[beta_hat] - beta_true. |
compute_wilson_ci | Wilson score confidence interval for a binomial proportion. |
coverage | Compute coverage probability. |
empirical_se | Compute empirical standard error (SD of estimates). |
mean_se | Compute mean of standard errors across simulations. |
rejection_rate | Compute rejection rate (proportion of p-values < alpha). |
rmse | Compute root mean squared error. |
Functions¶
bias¶
bias(estimates: ArrayLike, true_value: float) -> floatCompute bias: E[beta_hat] - beta_true.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimates | ArrayLike | Array of parameter estimates from simulations. | required |
true_value | float | True parameter value. | required |
Returns:
| Type | Description |
|---|---|
float | Bias (positive = overestimate, negative = underestimate). |
Examples:
estimates = np.array([1.02, 0.98, 1.01, 0.99, 1.00])
bias(estimates, true_value=1.0)
# 0.0compute_wilson_ci¶
compute_wilson_ci(p_hat: float, n: int, level: float = 0.95) -> tuple[float, float]Wilson score confidence interval for a binomial proportion.
Better coverage than the Wald interval near p=0 or p=1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p_hat | float | Observed proportion (0 to 1). | required |
n | int | Number of trials. | required |
level | float | Confidence level (default 0.95). | 0.95 |
Returns:
| Type | Description |
|---|---|
tuple[float, float] | Tuple of (lower, upper) bounds. |
Examples:
compute_wilson_ci(0.8, n=100, level=0.95)
# (0.711..., 0.868...)coverage¶
coverage(ci_lower: ArrayLike, ci_upper: ArrayLike, true_value: float) -> floatCompute coverage probability.
Coverage = P(CI_lower <= beta_true <= CI_upper)
For a correctly calibrated 95% CI, coverage should be ~0.95.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ci_lower | ArrayLike | Array of CI lower bounds from simulations. | required |
ci_upper | ArrayLike | Array of CI upper bounds from simulations. | required |
true_value | float | True parameter value. | required |
Returns:
| Type | Description |
|---|---|
float | Coverage probability (0 to 1). |
Examples:
ci_lower = np.array([0.8, 0.9, 0.85, 0.95, 0.88])
ci_upper = np.array([1.2, 1.1, 1.15, 1.05, 1.12])
coverage(ci_lower, ci_upper, true_value=1.0)
# 1.0empirical_se¶
empirical_se(estimates: ArrayLike) -> floatCompute empirical standard error (SD of estimates).
This is the “true” standard error as estimated from the simulation distribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimates | ArrayLike | Array of parameter estimates from simulations. | required |
Returns:
| Type | Description |
|---|---|
float | Empirical SE (standard deviation of estimates). |
mean_se¶
mean_se(std_errors: ArrayLike) -> floatCompute mean of standard errors across simulations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
std_errors | ArrayLike | Array of SE estimates from simulations. | required |
Returns:
| Type | Description |
|---|---|
float | Mean SE. |
rejection_rate¶
rejection_rate(p_values: ArrayLike, alpha: float = 0.05) -> floatCompute rejection rate (proportion of p-values < alpha).
For null effects (beta=0), this estimates Type I error rate. For non-null effects (beta!=0), this estimates power.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p_values | ArrayLike | Array of p-values from simulations. | required |
alpha | float | Significance level (default 0.05). | 0.05 |
Returns:
| Type | Description |
|---|---|
float | Rejection rate (0 to 1). |
Examples:
p_values = np.array([0.01, 0.06, 0.03, 0.12, 0.04])
rejection_rate(p_values, alpha=0.05)
# 0.6rmse¶
rmse(estimates: ArrayLike, true_value: float) -> floatCompute root mean squared error.
RMSE = sqrt(E[(beta_hat - beta)^2]) = sqrt(Var(beta_hat) + Bias^2)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimates | ArrayLike | Array of parameter estimates from simulations. | required |
true_value | float | True parameter value. | required |
Returns:
| Type | Description |
|---|---|
float | RMSE value. |
Examples:
estimates = np.array([1.1, 0.9, 1.0, 1.0, 1.0])
rmse(estimates, true_value=1.0)
# 0.063...model_sim¶
Model-level simulation operations.
Provides pre-fit data generation (generate_data_from_spec) and post-fit
response simulation (simulate_responses_from_fit) as pure functions on
containers. Extracted from model/core.py to enforce the “model is glue”
boundary.
Functions:
| Name | Description |
|---|---|
compute_mu_with_new_re | Compute conditional mean with newly sampled random effects. |
generate_data_from_spec | Generate a synthetic dataset from a simulation specification. |
simulate_responses_from_fit | Simulate new responses from a fitted model. |
Classes¶
Functions¶
compute_mu_with_new_re¶
compute_mu_with_new_re(bundle: 'DataBundle', fit: 'FitState', spec: 'ModelSpec', rng: np.random.Generator) -> np.ndarrayCompute conditional mean with newly sampled random effects.
Draws u_new ~ N(0, I), transforms via the relative covariance factor
Lambda to get new BLUPs, then computes mu = X @ coef + Z @ (Lambda @ u_new * sigma).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bundle | ‘DataBundle’ | Data bundle with X, Z matrices and RE metadata. | required |
fit | ‘FitState’ | Fit state with theta, sigma, and coefficients. | required |
spec | ‘ModelSpec’ | Model specification with family and link info. | required |
rng | Generator | NumPy random generator for reproducibility. | required |
Returns:
| Type | Description |
|---|---|
ndarray | New mu values on the response scale, shape (n,). |
generate_data_from_spec¶
generate_data_from_spec(sim_spec: 'SimulationSpec', family: str, response_var: str) -> pl.DataFrameGenerate a synthetic dataset from a simulation specification.
Implements the data-generating process (DGP): samples predictors from their distributions, constructs the design matrix, computes the linear predictor, and draws response values from the appropriate family.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sim_spec | ‘SimulationSpec’ | Simulation specification with sample size, distributions, coefficients, and random effect structure. | required |
family | str | GLM family name (e.g. “gaussian”, “binomial”, “poisson”). | required |
response_var | str | Name for the response column in the output DataFrame. | required |
Returns:
| Type | Description |
|---|---|
DataFrame | Polars DataFrame with generated predictor and response columns. |
simulate_responses_from_fit¶
simulate_responses_from_fit(fit: 'FitState', bundle: 'DataBundle', spec: 'ModelSpec', nsim: int, seed: int | None, varying: str = 'fitted') -> pl.DataFrameSimulate new responses from a fitted model.
Uses the fitted coefficients and estimated residual variance to generate new response values. For gaussian models, adds normally distributed errors. For GLMs, generates from the appropriate distribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fit | ‘FitState’ | Fitted model state with coefficients, fitted values, and sigma. | required |
bundle | ‘DataBundle’ | Data bundle with design matrix and RE metadata. | required |
spec | ‘ModelSpec’ | Model specification with family and link information. | required |
nsim | int | Number of simulations to generate. | required |
seed | int | None | Random seed for reproducibility, or None. | required |
varying | str | How to handle random effects. "fitted" uses estimated BLUPs, "sample" draws new random effects each simulation. | ‘fitted’ |
Returns:
| Type | Description |
|---|---|
DataFrame | DataFrame with columns sim_1, sim_2, ..., sim_nsim. |
power¶
Power analysis via simulation sweep grids.
Provides utilities for constructing full-factorial parameter grids and
running simulation-based power analyses across those grids. Each grid
point triggers a Monte Carlo study via run_monte_carlo(), and the
results are aggregated into a Polars DataFrame with per-parameter
power, coverage, bias, RMSE, and other diagnostics.
Functions:
| Name | Description |
|---|---|
expand_sweep_grid | Full factorial grid from base DGP + power sweep overrides. |
run_power_analysis | Run simulation-based power analysis for a model formula. |
run_power_study | Run power analysis across a sweep grid. |
Classes¶
Functions¶
expand_sweep_grid¶
expand_sweep_grid(base_n: int, base_coef: dict[str, float], base_sigma: float, base_varying: dict[str, dict[str, Any]], power_config: dict[str, Any]) -> list[dict[str, Any]]Full factorial grid from base DGP + power sweep overrides.
Uses itertools.product for Cartesian expansion. Handles nested
coef dict merging (e.g., sweeping one coefficient while keeping
others fixed).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_n | int | Base sample size. | required |
base_coef | dict[str, float] | Base coefficient values (e.g., {"Intercept": 0.0, "x": 0.5}). | required |
base_sigma | float | Base residual standard deviation. | required |
base_varying | dict[str, dict[str, Any]] | Base random effect specs (empty dict for fixed-effects models). | required |
power_config | dict[str, Any] | Sweep configuration dict. Keys are sweep axes ("n", "coef", "sigma", "varying"). Non-sweep control keys ("n_sims", "alpha", "conf_level", "n_jobs", "verbose", "seed") are ignored. | required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | List of grid point dicts, each with keys: "n", "coef", |
list[dict[str, Any]] | "sigma", "varying". |
Examples:
>>> grid = expand_sweep_grid(
... base_n=100, base_coef={"x": 0.5}, base_sigma=1.0,
... base_varying={},
... power_config={"n": [50, 100, 200]},
... )
>>> len(grid)
3
>>> grid[0]["n"]
50run_power_analysis¶
run_power_analysis(*, formula: str, family: str, response_var: str, power: int | dict[str, Any], n: int, seed: int | None, coef: dict[str, float] | None, sigma: float, var_specs: dict[str, object]) -> pl.DataFrameRun simulation-based power analysis for a model formula.
Pure-function extraction of the power analysis logic from the model class.
Builds a sweep grid, constructs DGP and fit closures, and delegates to
run_power_study().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula | str | R-style model formula string. | required |
family | str | Response distribution family name. | required |
response_var | str | Name of the response variable. | required |
power | int | dict[str, Any] | Power config — int (n_sims) or dict with sweep axes and control params (n_sims, alpha, conf_level, n_jobs, verbose, seed). | required |
n | int | Base sample size. | required |
seed | int | None | Random seed (used as fallback if power config has no seed). | required |
coef | dict[str, float] | None | Base coefficient values (or None for defaults). | required |
sigma | float | Base residual standard deviation. | required |
var_specs | dict[str, object] | Distribution specs for predictors (may include Varying instances for random effects). | required |
Returns:
| Type | Description |
|---|---|
DataFrame | Polars DataFrame with one row per grid point per parameter. |
DataFrame | Columns include sweep axes, power metrics, and diagnostics. |
run_power_study¶
run_power_study(sweep_grid: list[dict[str, Any]], dgp_fn: Callable[..., tuple[pl.DataFrame, dict[str, Any]]], fit_fn: Callable[[pl.DataFrame], Any], n_sims: int, seed: int, alpha: float = 0.05, ci_level: float = 0.95, n_jobs: int = 1, verbose: bool = False) -> pl.DataFrameRun power analysis across a sweep grid.
For each grid point, calls run_monte_carlo() to run n_sims
simulations, then extracts per-parameter metrics (power, coverage,
bias, RMSE, etc.) and adds Wilson confidence intervals on power
estimates. Returns a Polars DataFrame with one row per grid point
per parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sweep_grid | list[dict[str, Any]] | List of grid point dicts from expand_sweep_grid(). Each dict has keys "n", "coef", "sigma", "varying". | required |
dgp_fn | Callable..., [tuple[DataFrame, dict[str, Any]]] | Data generating function that returns (data, true_params). Must accept a seed parameter. | required |
fit_fn | Callable[[DataFrame], Any] | Function that takes a DataFrame and returns a fitted model with a .params property. | required |
n_sims | int | Number of simulations per grid point. | required |
seed | int | Base random seed. Each grid point offsets by i * n_sims. | required |
alpha | float | Significance level for power (default 0.05). | 0.05 |
ci_level | float | Confidence level for Wilson CIs on power (default 0.95). | 0.95 |
n_jobs | int | Number of parallel jobs for run_monte_carlo() (default 1). | 1 |
verbose | bool | Show progress (default False). | False |
Returns:
| Type | Description |
|---|---|
DataFrame | Polars DataFrame with one row per grid point per parameter. |
DataFrame | Columns include sweep axes (n, sigma), parameter info |
DataFrame | (term, true_value), power metrics (power, |
DataFrame | power_ci_lower, power_ci_upper), and diagnostics |
DataFrame | (coverage, bias, rmse, mean_se, empirical_se, |
DataFrame | n_sims, n_failed). |