design - bossanova

Design matrix construction — coding, naming, reference grids, random effects.

Call chain:

formula.build_design_matrices() -> treatment_coding() / sum_coding() / ... (categorical columns)
marginal.build_reference_grid() -> build_reference_row() (EMM reference grids)
formula.build_random_effects_from_spec() -> build_z_simple() / build_z_nested() / build_z_crossed()

Classes:

Name	Description
`DesignColumnInfo`	Parsed design matrix column metadata.
`RandomEffectsInfo`	Complete random effects specification for lmer/glmer.

Functions:

Name	Description
`array_to_coding_matrix`	Convert user-specified contrasts to a coding matrix for design matrices.
`build_random_effects`	Build complete random effects specification.
`build_reference_design_matrix`	Build design matrix for reference grid points.
`build_reference_row`	Build a single row of the reference design matrix.
`build_slope_reference_matrix`	Build reference matrices for computing marginal slopes.
`build_z_crossed`	Build Z matrix for crossed random effects.
`build_z_nested`	Build Z matrix for nested random effects.
`build_z_simple`	Build Z matrix for single grouping factor.
`convert_coding_to_hypothesis`	Convert a coding matrix back to interpretable hypothesis contrasts.
`extract_base_term`	Extract base term name from column name.
`extract_categorical_variables`	Find all categorical base variable names from design matrix columns.
`extract_level_from_column`	Extract level value for a specific focal variable from column name.
`helmert_coding`	Build Helmert contrast matrix.
`helmert_coding_labels`	Get column labels for Helmert contrast.
`identify_column_type`	Identify column type from name (simplified version).
`parse_design_column_name`	Parse design matrix column name into components.
`poly_coding`	Build orthogonal polynomial contrast matrix.
`poly_coding_labels`	Get column labels for polynomial contrast.
`sequential_coding`	Build sequential (successive differences) contrast matrix.
`sequential_coding_labels`	Get column labels for sequential contrast.
`sum_coding`	Build sum (effects) contrast matrix.
`sum_coding_labels`	Get column labels for sum contrast.
`treatment_coding`	Build treatment (dummy) contrast matrix.
`treatment_coding_labels`	Get column labels for treatment contrast.

Modules:

Name	Description
`coding`	Contrast matrix builders for categorical variable encoding.
`names`	Design matrix column name parsing and variable type detection.
`reference`	Reference design matrix (X_ref) construction for marginal effects.
`z_matrix`	Sparse Z matrix (random effects design matrix) construction.

Classes¶

DesignColumnInfo¶

DesignColumnInfo(raw_name: str, base_term: str, level: str | None, column_type: Literal['intercept', 'continuous', 'categorical'], is_interaction: bool = False) -> None

Parsed design matrix column metadata.

Attributes:

Name	Type	Description
`raw_name`	`str`	Original column name (e.g., “treatment[A]”).
`base_term`	`str`	Base variable name without level (e.g., “treatment”).
`level`	`str \| None`	Level value for categorical, None for continuous (e.g., “A”).
`column_type`	`Literal[‘intercept’, ‘continuous’, ‘categorical’]`	Type classification.
`is_interaction`	`bool`	Whether this is an interaction term.

Attributes¶

base_term¶

base_term: str

column_type¶

column_type: Literal['intercept', 'continuous', 'categorical']

is_interaction¶

is_interaction: bool = False

level¶

level: str | None

raw_name¶

raw_name: str

RandomEffectsInfo¶

RandomEffectsInfo(Z: sp.csc_matrix, group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, re_structures_list: list[str] | None = None, re_dims_list: list[int] | None = None, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, column_labels: list[str] = list(), term_permutation: NDArray[np.intp] | None = None) -> None

Complete random effects specification for lmer/glmer.

This container holds the Z matrix and all metadata needed for downstream operations (Lambda building, initialization, results).

Attributes:

Name	Type	Description
`Z`	`csc_matrix`	Sparse random effects design matrix, shape (n, q).
`group_ids_list`	`list[NDArray[intp]]`	Group ID arrays for each factor.
`n_groups_list`	`list[int]`	Number of groups per factor.
`group_names`	`list[str]`	Names of grouping factors.
`random_names`	`list[str]`	Names of random effect terms.
`re_structure`	`str`	Overall structure type (intercept/slope/diagonal/nested/crossed).
`re_structures_list`	`list[str] \| None`	Per-factor structure types (for mixed).
`X_re`	`NDArray[float64] \| list[NDArray[float64]] \| None`	Random effects covariates (for slopes).
`column_labels`	`list[str]`	Z column names for output.
`term_permutation`	`NDArray[intp] \| None`	Block ordering permutation indices.

Attributes¶

X_re¶

X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None

Z¶

Z: sp.csc_matrix

column_labels¶

column_labels: list[str] = field(default_factory=list)

group_ids_list¶

group_ids_list: list[NDArray[np.intp]]

group_names¶

group_names: list[str]

n_groups_list¶

n_groups_list: list[int]

random_names¶

random_names: list[str]

re_dims_list¶

re_dims_list: list[int] | None = None

re_structure¶

re_structure: str

re_structures_list¶

re_structures_list: list[str] | None = None

term_permutation¶

term_permutation: NDArray[np.intp] | None = None

Functions¶

array_to_coding_matrix¶

array_to_coding_matrix(contrasts: NDArray[np.floating] | list[float] | list[list[float]], n_levels: int, *, normalize: bool = True) -> NDArray[np.float64]

Convert user-specified contrasts to a coding matrix for design matrices.

This function converts “human-readable” contrast specifications (where each row represents a hypothesis like “A vs average(B, C)”) into a coding matrix suitable for use in regression design matrices.

The algorithm uses QR decomposition to auto-complete under-specified contrasts with orthogonal contrasts, following the approach from R’s gmodels::make.contrasts() and pymer4’s con2R().

Parameters:

Name	Type	Description	Default
`contrasts`	`NDArray[floating] \| list[float] \| list[list[float]]`	User-specified contrasts as: - 1D array/list: Single contrast vector of length n_levels - 2D array/list: Multiple contrasts, shape (n_contrasts, n_levels) Each row sums to zero for valid contrasts.	required
`n_levels`	`int`	Number of factor levels. Must match contrast dimensions.	required
`normalize`	`bool`	If True, normalize each contrast vector by its L2 norm before conversion. This puts contrasts in standard-deviation units similar to orthogonal polynomial contrasts.	`True`

Returns:

Type	Description
`NDArray[float64]`	Coding matrix of shape (n_levels, n_levels - 1). Each row corresponds
`NDArray[float64]`	to a factor level, each column to a design matrix column.

Examples:

>>> # Single contrast: A vs average(B, C)
>>> array_to_coding_matrix([-1, 0.5, 0.5], n_levels=3)
array([[-0.81649658,  0.        ],
       [ 0.40824829, -0.70710678],
       [ 0.40824829,  0.70710678]])

>>> # Multiple contrasts: A vs B, and (A,B) vs C
>>> array_to_coding_matrix([[-1, 1, 0], [-0.5, -0.5, 1]], n_levels=3)
array([[-0.5       , -0.28867513],
       [ 0.5       , -0.28867513],
       [ 0.        ,  0.57735027]])

Note: The returned matrix has n_levels-1 columns because one degree of freedom is absorbed by the intercept. If you specify fewer than n_levels-1 contrasts, the remaining columns are auto-completed with orthogonal contrasts via QR decomposition.

build_random_effects¶

build_random_effects(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, re_structures_list: list[str] | None = None, group_levels_list: list[list[str]] | None = None, term_permutation: NDArray[np.intp] | None = None) -> RandomEffectsInfo

Build complete random effects specification.

High-level function that constructs the Z matrix and packages all metadata into a RandomEffectsInfo container ready for lmer/glmer consumption.

Parameters:

Name	Type	Description	Default
`group_ids_list`	`list[NDArray[intp]]`	Group ID arrays for each factor.	required
`n_groups_list`	`list[int]`	Number of groups per factor.	required
`group_names`	`list[str]`	Names of grouping factors.	required
`random_names`	`list[str]`	Names of random effect terms.	required
`re_structure`	`str`	Overall structure type: - “intercept”: random intercept only - “slope”: correlated intercept + slopes - “diagonal”: uncorrelated intercept + slopes - “nested”: nested hierarchy - “crossed”: crossed factors	required
`X_re`	`NDArray[float64] \| list[NDArray[float64]] \| None`	Random effects covariates (for slopes).	`None`
`re_structures_list`	`list[str] \| None`	Per-factor structure (for mixed).	`None`
`group_levels_list`	`list[list[str]] \| None`	Level names per factor (for labels).	`None`
`term_permutation`	`NDArray[intp] \| None`	Block ordering permutation.	`None`

Returns:

Type	Description
`RandomEffectsInfo`	RandomEffectsInfo with Z matrix and all metadata.

Examples:

>>> # (Days|Subject) with 18 subjects
>>> group_ids = np.arange(180) // 10
>>> n_groups = 18
>>> X_re = np.column_stack([np.ones(180), np.tile(np.arange(10), 18)])
>>> info = build_random_effects(
...     group_ids_list=[group_ids],
...     n_groups_list=[n_groups],
...     group_names=["Subject"],
...     random_names=["Intercept", "Days"],
...     re_structure="slope",
...     X_re=X_re,
... )
>>> info.Z.shape
(180, 36)  # 18 subjects * 2 RE

build_reference_design_matrix¶

build_reference_design_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, levels: list[str], X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarray

Build design matrix for reference grid points.

Creates an X_ref matrix with one row per focal variable level. Each row represents a reference point where the focal variable is set to that level and all other covariates are set to their reference values.

Reference value conventions:

Intercept: 1.0
Focal variable dummies: 1.0 if matching level, 0.0 otherwise
Continuous covariates: column mean from X_means
Non-focal categorical dummies: column mean (marginalize over observed proportions)

Parameters:

Name	Type	Description	Default
`X_names`	`tuple[str, ...] \| list[str]`	Column names from the design matrix, in order.	required
`focal_var`	`str`	Name of the categorical variable to vary across levels.	required
`levels`	`list[str]`	List of levels for the focal variable, defining row order.	required
`X_means`	`ndarray`	Column means of the original design matrix, shape (p,). Used for continuous covariate reference values.	required
`set_categoricals`	`dict[str, str] \| None`	Optional dict mapping non-focal categorical variable names to specific levels to pin them at (instead of marginalizing at X_means). E.g. `{"Ethnicity": "Asian"}` sets the Ethnicity dummies to indicator values for “Asian”.	`None`

Returns:

Type	Description
`ndarray`	Reference design matrix X_ref, shape (n_levels, p).

Examples:

Compute X_ref for treatment EMMs::

X_names = ("Intercept", "x", "treatment[A]", "treatment[B]")
X_means = np.array([1.0, 2.5, 0.33, 0.33])  # means from data
levels = ["ref", "A", "B"]

X_ref = build_reference_design_matrix(X_names, "treatment", levels, X_means)
# X_ref[0] = [1.0, 2.5, 0.0, 0.0]  # reference level
# X_ref[1] = [1.0, 2.5, 1.0, 0.0]  # level A
# X_ref[2] = [1.0, 2.5, 0.0, 1.0]  # level B

Note: The first level is typically the reference level (omitted from dummy coding), so its row has all 0s for the focal variable dummies.

build_reference_row¶

build_reference_row(X_names: tuple[str, ...] | list[str], focal_var: str, focal_level: str, X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarray

Build a single row of the reference design matrix.

Creates one reference point where the focal variable is set to the specified level and other covariates are at reference values.

For interaction columns involving the focal variable (e.g., Income:Student[Yes] when focal_var="Student"), the value is computed as the product of component values rather than using the empirical mean of the interaction column.

Parameters:

Name	Type	Description	Default
`X_names`	`tuple[str, ...] \| list[str]`	Column names from the design matrix.	required
`focal_var`	`str`	Name of the focal categorical variable.	required
`focal_level`	`str`	Level value to set for the focal variable.	required
`X_means`	`ndarray`	Column means for continuous covariate reference values.	required
`set_categoricals`	`dict[str, str] \| None`	Optional dict mapping non-focal categorical variable names to specific levels for indicator encoding. When a non-focal categorical’s `base_term` matches a key, the dummy is set to `1.0` if the level matches, `0.0` otherwise (instead of using the column mean for marginalization).	`None`

Returns:

Type	Description
`ndarray`	Reference row, shape (p,).

build_slope_reference_matrix¶

build_slope_reference_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, X_means: np.ndarray, *, delta: float = 1.0) -> tuple[np.ndarray, np.ndarray]

Build reference matrices for computing marginal slopes.

Creates two reference rows: one at the mean and one at mean + delta for the focal continuous variable. The slope is then (y1 - y0) / delta.

For interaction columns involving the focal variable (e.g., x:z when focal_var="x"), the interaction value is properly computed as a product of component values, so the perturbed row reflects the interaction contribution to the slope.

Parameters:

Name	Type	Description	Default
`X_names`	`tuple[str, ...] \| list[str]`	Column names from the design matrix.	required
`focal_var`	`str`	Name of the continuous variable for slope computation.	required
`X_means`	`ndarray`	Column means of the original design matrix.	required
`delta`	`float`	Step size for numerical differentiation (default 1.0).	`1.0`

Returns:

Type	Description
`ndarray`	Tuple of (X_ref_0, X_ref_1) where:
`ndarray`	- X_ref_0: Reference point at mean of focal_var
`tuple[ndarray, ndarray]`	- X_ref_1: Reference point at mean + delta of focal_var

build_z_crossed¶

build_z_crossed(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None, layouts: list[str] | None = None) -> sp.csc_matrix

Build Z matrix for crossed random effects.

Crossed effects like (1|subject) + (1|item) create independent random effects for each factor. The Z matrix is a horizontal concatenation of Z matrices for each factor.

Parameters:

Name	Type	Description	Default
`group_ids_list`	`list[NDArray[intp]]`	List of group ID arrays, one per factor.	required
`n_groups_list`	`list[int]`	Number of groups per factor.	required
`X_re_list`	`list[NDArray[float64] \| None] \| None`	Random effects design per factor, or None for intercepts.	`None`
`layouts`	`list[str] \| None`	Layout per factor. Default: interleaved for all.	`None`

Returns:

Type	Description
`csc_matrix`	Sparse Z matrix, shape (n, sum(n_groups_i * n_re_i)).

Examples:

>>> # (1|subject) + (1|item) with 3 subjects, 4 items
>>> subj_ids = np.array([0, 1, 2, 0, 1, 2])
>>> item_ids = np.array([0, 0, 0, 1, 1, 1])
>>> Z = build_z_crossed(
...     group_ids_list=[subj_ids, item_ids],
...     n_groups_list=[3, 4]
... )
>>> Z.shape
(6, 7)  # 3 subject + 4 item columns

build_z_nested¶

build_z_nested(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None) -> sp.csc_matrix

Build Z matrix for nested random effects.

Nested effects like (1|school/class) create separate random intercepts for each level of the hierarchy. The Z matrix is a horizontal concatenation of Z matrices for each level.

Parameters:

Name	Type	Description	Default
`group_ids_list`	`list[NDArray[intp]]`	List of group ID arrays, ordered [inner, ..., outer]. For (1	school/class): [class_ids, school_ids].
`n_groups_list`	`list[int]`	Number of groups at each level.	required
`X_re_list`	`list[NDArray[float64] \| None] \| None`	Random effects design per level, or None for intercepts.	`None`

Returns:

Type	Description
`csc_matrix`	Sparse Z matrix, shape (n, sum(n_groups_i * n_re_i)).

Examples:

>>> # (1|school/class) with 2 schools, 4 classes total
>>> class_ids = np.array([0, 0, 1, 1, 2, 2, 3, 3])
>>> school_ids = np.array([0, 0, 0, 0, 1, 1, 1, 1])
>>> Z = build_z_nested(
...     group_ids_list=[class_ids, school_ids],
...     n_groups_list=[4, 2]
... )
>>> Z.shape
(8, 6)  # 4 class columns + 2 school columns

build_z_simple¶

build_z_simple(group_ids: NDArray[np.intp], n_groups: int, X_re: NDArray[np.float64] | None = None, layout: Literal['interleaved', 'blocked'] = 'interleaved') -> sp.csc_matrix

Build Z matrix for single grouping factor.

Constructs Z directly in sparse COO format without dense intermediates. For large-scale data (e.g., InstEval with 73k obs x 4k groups), this uses O(n x n_re) memory instead of O(n x n_groups x n_re).

Handles intercept-only, correlated slopes, and uncorrelated slopes by varying the X_re input and layout parameter.

Parameters:

Name	Type	Description	Default
`group_ids`	`NDArray[intp]`	Array of group assignments, shape (n,), values 0..n_groups-1.	required
`n_groups`	`int`	Total number of groups.	required
`X_re`	`NDArray[float64] \| None`	Random effects design matrix, shape (n, n_re). - None or column of 1s: intercept only - Multiple columns: intercept + slopes	`None`
`layout`	`Literal[‘interleaved’, ‘blocked’]`	Column ordering. - “interleaved”: [g1_int, g1_slope, g2_int, g2_slope, ...] - “blocked”: [g1_int, g2_int, ..., g1_slope, g2_slope, ...]	`‘interleaved’`

Returns:

Type	Description
`csc_matrix`	Sparse Z matrix in CSC format, shape (n, n_groups * n_re).

Examples:

>>> # Random intercept only
>>> group_ids = np.array([0, 0, 1, 1])
>>> Z = build_z_simple(group_ids, n_groups=2)
>>> Z.shape
(4, 2)

>>> # Random intercept + slope (correlated)
>>> X_re = np.column_stack([np.ones(4), [1, 2, 1, 2]])
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="interleaved")
>>> Z.shape
(4, 4)

>>> # Uncorrelated random effects
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="blocked")
>>> Z.shape
(4, 4)

convert_coding_to_hypothesis¶

convert_coding_to_hypothesis(coding_matrix: NDArray[np.float64]) -> NDArray[np.float64]

Convert a coding matrix back to interpretable hypothesis contrasts.

This is the inverse of array_to_coding_matrix. Given a coding matrix (n_levels, n_levels-1), returns the hypothesis matrix where each row represents the linear combination of factor levels being compared.

Parameters:

Name	Type	Description	Default
`coding_matrix`	`NDArray[float64]`	Coding matrix of shape (n_levels, n_levels - 1).	required

Returns:

Type	Description
`NDArray[float64]`	Hypothesis matrix of shape (n_levels - 1, n_levels). Each row
`NDArray[float64]`	represents a contrast hypothesis (coefficients for factor levels).

Examples:

>>> cm = treatment_coding(['A', 'B', 'C'])
>>> convert_coding_to_hypothesis(cm)
array([[-1.,  1.,  0.],
       [-1.,  0.,  1.]])

extract_base_term¶

extract_base_term(name: str) -> str

Extract base term name from column name.

For categorical variables, strips the level suffix. For interactions, extracts base terms without levels.

Parameters:

Name	Type	Description	Default
`name`	`str`	Column name from design matrix.	required

Returns:

Type	Description
`str`	Base term name.

Examples:

>>> extract_base_term("treatment[A]")
'treatment'
>>> extract_base_term("x")
'x'
>>> extract_base_term("treatment[A]:x")
'treatment:x'

extract_categorical_variables¶

extract_categorical_variables(X_names: tuple[str, ...] | list[str]) -> set[str]

Find all categorical base variable names from design matrix columns.

Scans column names for bracket patterns and extracts unique base terms.

Parameters:

Name	Type	Description	Default
`X_names`	`tuple[str, ...] \| list[str]`	Column names from design matrix.	required

Returns:

Type	Description
`set[str]`	Set of categorical variable names (base terms, no levels).

Examples:

>>> extract_categorical_variables(["Intercept", "x", "treatment[A]", "treatment[B]"])
{'treatment'}

extract_level_from_column¶

extract_level_from_column(name: str, focal_var: str) -> str | None

Extract level value for a specific focal variable from column name.

Used when building reference grids to identify which column corresponds to which level of the focal variable.

Parameters:

Name	Type	Description	Default
`name`	`str`	Column name (e.g., “treatment[A]” or “treatment[A]:x”).	required
`focal_var`	`str`	The focal variable name (e.g., “treatment”).	required

Returns:

Type	Description
`str \| None`	Level value if column is for focal_var, else None.

Examples:

>>> extract_level_from_column("treatment[A]", "treatment")
'A'
>>> extract_level_from_column("treatment[A]:x", "treatment")
'A'
>>> extract_level_from_column("x", "treatment")
None
>>> extract_level_from_column("group[1]", "treatment")
None

helmert_coding¶

helmert_coding(levels: list[str]) -> NDArray[np.float64]

Build Helmert contrast matrix.

Helmert coding compares each level to the mean of all previous levels. Column j contrasts level j+1 against the average of levels 0..j.

This is equivalent to R’s contr.helmert() (scaled to unit contrasts).

Matrix structure for 4 levels::

Contrast   | A       | B       | C       | D
-----------|---------|---------|---------|--------
B vs A     | -1/2    |  1/2    |  0      |  0
C vs A,B   | -1/3    | -1/3    |  2/3    |  0
D vs A,B,C | -1/4    | -1/4    | -1/4    |  3/4

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names. Must have >= 2 elements.	required

Returns:

Type	Description
`NDArray[float64]`	Contrast matrix of shape (n_levels, n_levels - 1).
`NDArray[float64]`	Row order matches input levels order.

Examples:

>>> helmert_coding(['A', 'B', 'C'])
array([[-0.5       , -0.33333333],
       [ 0.5       , -0.33333333],
       [ 0.        ,  0.66666667]])

helmert_coding_labels¶

helmert_coding_labels(levels: list[str]) -> list[str]

Get column labels for Helmert contrast.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required

Returns:

Type	Description
`list[str]`	List of labels like [‘B vs prev’, ‘C vs prev’, ...].

Examples:

>>> helmert_coding_labels(['A', 'B', 'C'])
['B vs prev', 'C vs prev']

identify_column_type¶

identify_column_type(name: str) -> Literal['intercept', 'continuous', 'categorical']

Identify column type from name (simplified version).

This is a lightweight alternative to parse_design_column_name() when only the type is needed.

Parameters:

Name	Type	Description	Default
`name`	`str`	Column name from design matrix.	required

Returns:

Type	Description
`Literal[‘intercept’, ‘continuous’, ‘categorical’]`	Column type as string literal.

parse_design_column_name¶

parse_design_column_name(name: str) -> DesignColumnInfo

Parse design matrix column name into components.

Handles standard R/formula naming conventions:

“Intercept” → intercept type
“x” → continuous variable
“treatment[A]” → categorical dummy for level A
“x:z” → continuous interaction
“treatment[A]:x” → categorical × continuous interaction

Parameters:

Name	Type	Description	Default
`name`	`str`	Column name from design matrix.	required

Returns:

Type	Description
`DesignColumnInfo`	DesignColumnInfo with parsed components.

Examples:

>>> parse_design_column_name("Intercept")
DesignColumnInfo(raw_name='Intercept', base_term='Intercept',
                 level=None, column_type='intercept', is_interaction=False)

>>> parse_design_column_name("x")
DesignColumnInfo(raw_name='x', base_term='x',
                 level=None, column_type='continuous', is_interaction=False)

>>> parse_design_column_name("treatment[A]")
DesignColumnInfo(raw_name='treatment[A]', base_term='treatment',
                 level='A', column_type='categorical', is_interaction=False)

>>> parse_design_column_name("treatment[A]:x")
DesignColumnInfo(raw_name='treatment[A]:x', base_term='treatment:x',
                 level='A', column_type='categorical', is_interaction=True)

poly_coding¶

poly_coding(levels: list[str]) -> NDArray[np.float64]

Build orthogonal polynomial contrast matrix.

Polynomial coding creates orthogonal contrasts representing linear, quadratic, cubic, etc. trends across ordered factor levels. This is equivalent to R’s contr.poly() function.

The contrasts are orthonormal (orthogonal and unit length), making them suitable for testing polynomial trends in ordered categorical variables.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names. The order determines the polynomial evaluation points.	required

Returns:

Type	Description
`NDArray[float64]`	Contrast matrix of shape (n_levels, n_levels - 1).
`NDArray[float64]`	Column 0 is linear (.L), column 1 is quadratic (.Q), etc.

Examples:

>>> poly_coding(['low', 'medium', 'high'])
array([[-0.70710678,  0.40824829],
       [ 0.        , -0.81649658],
       [ 0.70710678,  0.40824829]])

>>> poly_coding(['A', 'B', 'C', 'D'])
array([[-0.67082039,  0.5       , -0.2236068 ],
       [-0.2236068 , -0.5       ,  0.67082039],
       [ 0.2236068 , -0.5       , -0.67082039],
       [ 0.67082039,  0.5       ,  0.2236068 ]])

Note: Level names are not used in computation - only the count and order matter. The polynomial is evaluated at equally-spaced points 1, 2, ..., n.

poly_coding_labels¶

poly_coding_labels(levels: list[str]) -> list[str]

Get column labels for polynomial contrast.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required

Returns:

Type	Description
`list[str]`	List of polynomial degree labels: `['.L', '.Q', '.C', '^4', '^5', ...]`.

sequential_coding¶

sequential_coding(levels: list[str]) -> NDArray[np.float64]

Build sequential (successive differences) contrast matrix.

Sequential coding compares each level to the previous level, producing contrasts that capture successive differences. This is equivalent to R’s MASS::contr.sdif() function.

The matrix is constructed so that each column j represents the difference between level j+1 and level j. The resulting coefficients in a regression model estimate these successive differences directly.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required

Returns:

Type	Description
`NDArray[float64]`	Contrast matrix of shape (n_levels, n_levels - 1).
`NDArray[float64]`	Column j represents the contrast: level[j+1] - level[j].

Examples:

>>> sequential_coding(['A', 'B', 'C'])
array([[-0.66666667, -0.33333333],
       [ 0.33333333, -0.33333333],
       [ 0.33333333,  0.66666667]])

>>> sequential_coding(['low', 'medium', 'high', 'very_high'])
array([[-0.75, -0.5 , -0.25],
       [ 0.25, -0.5 , -0.25],
       [ 0.25,  0.5 , -0.25],
       [ 0.25,  0.5 ,  0.75]])

Note: This coding is most meaningful for ordered factors where you want to estimate the “step” from one level to the next. Unlike polynomial contrasts, it does not assume equally-spaced levels.

The matrix structure ensures that multiplying by the coefficient vector gives interpretable successive differences.

sequential_coding_labels¶

sequential_coding_labels(levels: list[str]) -> list[str]

Get column labels for sequential contrast.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required

Returns:

Type	Description
`list[str]`	List of successive difference labels like [‘B-A’, ‘C-B’, ...].

Examples:

>>> sequential_coding_labels(['A', 'B', 'C'])
['B-A', 'C-B']

>>> sequential_coding_labels(['low', 'medium', 'high'])
['medium-low', 'high-medium']

sum_coding¶

sum_coding(levels: list[str], omit: str | None = None) -> NDArray[np.float64]

Build sum (effects) contrast matrix.

Sum coding sets the omitted level to all -1s, and each other level gets a one-hot encoded row. This centers the effects around zero, making coefficients interpretable as deviations from the grand mean.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required
`omit`	`str \| None`	Level to omit (gets -1s). Defaults to last level.	`None`

Returns:

Type	Description
`NDArray[float64]`	Contrast matrix of shape (n_levels, n_levels - 1).
`NDArray[float64]`	Row order matches input levels order.

Examples:

>>> sum_coding(['A', 'B', 'C'])
array([[ 1.,  0.],
       [ 0.,  1.],
       [-1., -1.]])

>>> sum_coding(['A', 'B', 'C'], omit='A')
array([[-1., -1.],
       [ 1.,  0.],
       [ 0.,  1.]])

sum_coding_labels¶

sum_coding_labels(levels: list[str], omit: str | None = None) -> list[str]

Get column labels for sum contrast.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required
`omit`	`str \| None`	Level to omit. Defaults to last level.	`None`

Returns:

Type	Description
`list[str]`	List of non-omitted level names (column labels).

treatment_coding¶

treatment_coding(levels: list[str], reference: str | None = None) -> NDArray[np.float64]

Build treatment (dummy) contrast matrix.

Treatment coding sets the reference level to all zeros, and each other level gets a one-hot encoded row. This is the most common coding for regression models with an intercept.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required
`reference`	`str \| None`	Reference level name. Defaults to first level.	`None`

Returns:

Type	Description
`NDArray[float64]`	Contrast matrix of shape (n_levels, n_levels - 1).
`NDArray[float64]`	Row order matches input levels order.

Examples:

>>> treatment_coding(['A', 'B', 'C'])
array([[0., 0.],
       [1., 0.],
       [0., 1.]])

>>> treatment_coding(['A', 'B', 'C'], reference='B')
array([[1., 0.],
       [0., 0.],
       [0., 1.]])

treatment_coding_labels¶

treatment_coding_labels(levels: list[str], reference: str | None = None) -> list[str]

Get column labels for treatment contrast.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required
`reference`	`str \| None`	Reference level name. Defaults to first level.	`None`

Returns:

Type	Description
`list[str]`	List of non-reference level names (column labels).

Modules¶

coding¶

Contrast matrix builders for categorical variable encoding.

This module provides functions to create contrast matrices for encoding categorical variables in design matrices. These are distinct from the EMM contrast matrices in

Key concept: A contrast matrix maps k categorical levels to k-1 columns in the design matrix (assuming an intercept absorbs one degree of freedom).

Treatment (dummy coding): Reference level = 0, others = one-hot
Treatment (dummy coding): Reference level = 0, others = one-hot
Sum (effects coding): Omitted level = -1s, others = one-hot
Poly (orthogonal polynomial): Linear, quadratic, cubic, etc. trends
Custom: User-specified contrast vectors converted via array_to_coding_matrix

Examples:

>>> from coding import treatment_coding, sum_coding, poly_coding
>>> treatment_coding(['A', 'B', 'C'])
array([[0., 0.],
       [1., 0.],
       [0., 1.]])
>>> sum_coding(['A', 'B', 'C'])
array([[ 1.,  0.],
       [ 0.,  1.],
       [-1., -1.]])
>>> poly_coding(['A', 'B', 'C'])  # Linear and quadratic trends
array([[-0.707...,  0.408...],
       [ 0.   ..., -0.816...],
       [ 0.707...,  0.408...]])
>>> # Custom contrast: A vs average(B, C)
>>> array_to_coding_matrix([[-1, 0.5, 0.5]], n_levels=3)
array([[-0.816..., ...],
       [ 0.408..., ...],
       [ 0.408..., ...]])

Functions:

Name	Description
`array_to_coding_matrix`	Convert user-specified contrasts to a coding matrix for design matrices.
`convert_coding_to_hypothesis`	Convert a coding matrix back to interpretable hypothesis contrasts.
`helmert_coding`	Build Helmert contrast matrix.
`helmert_coding_labels`	Get column labels for Helmert contrast.
`poly_coding`	Build orthogonal polynomial contrast matrix.
`poly_coding_labels`	Get column labels for polynomial contrast.
`sequential_coding`	Build sequential (successive differences) contrast matrix.
`sequential_coding_labels`	Get column labels for sequential contrast.
`sum_coding`	Build sum (effects) contrast matrix.
`sum_coding_labels`	Get column labels for sum contrast.
`treatment_coding`	Build treatment (dummy) contrast matrix.
`treatment_coding_labels`	Get column labels for treatment contrast.

Functions¶

array_to_coding_matrix¶

array_to_coding_matrix(contrasts: NDArray[np.floating] | list[float] | list[list[float]], n_levels: int, *, normalize: bool = True) -> NDArray[np.float64]

Convert user-specified contrasts to a coding matrix for design matrices.

The algorithm uses QR decomposition to auto-complete under-specified contrasts with orthogonal contrasts, following the approach from R’s gmodels::make.contrasts() and pymer4’s con2R().

Parameters:

Name	Type	Description	Default
`contrasts`	`NDArray[floating] \| list[float] \| list[list[float]]`	User-specified contrasts as: - 1D array/list: Single contrast vector of length n_levels - 2D array/list: Multiple contrasts, shape (n_contrasts, n_levels) Each row sums to zero for valid contrasts.	required
`n_levels`	`int`	Number of factor levels. Must match contrast dimensions.	required
`normalize`	`bool`	If True, normalize each contrast vector by its L2 norm before conversion. This puts contrasts in standard-deviation units similar to orthogonal polynomial contrasts.	`True`

Returns:

Type	Description
`NDArray[float64]`	Coding matrix of shape (n_levels, n_levels - 1). Each row corresponds
`NDArray[float64]`	to a factor level, each column to a design matrix column.

Examples:

>>> # Single contrast: A vs average(B, C)
>>> array_to_coding_matrix([-1, 0.5, 0.5], n_levels=3)
array([[-0.81649658,  0.        ],
       [ 0.40824829, -0.70710678],
       [ 0.40824829,  0.70710678]])

>>> # Multiple contrasts: A vs B, and (A,B) vs C
>>> array_to_coding_matrix([[-1, 1, 0], [-0.5, -0.5, 1]], n_levels=3)
array([[-0.5       , -0.28867513],
       [ 0.5       , -0.28867513],
       [ 0.        ,  0.57735027]])

convert_coding_to_hypothesis¶

convert_coding_to_hypothesis(coding_matrix: NDArray[np.float64]) -> NDArray[np.float64]

Convert a coding matrix back to interpretable hypothesis contrasts.

Parameters:

Name	Type	Description	Default
`coding_matrix`	`NDArray[float64]`	Coding matrix of shape (n_levels, n_levels - 1).	required

Returns:

Type	Description
`NDArray[float64]`	Hypothesis matrix of shape (n_levels - 1, n_levels). Each row
`NDArray[float64]`	represents a contrast hypothesis (coefficients for factor levels).

Examples:

>>> cm = treatment_coding(['A', 'B', 'C'])
>>> convert_coding_to_hypothesis(cm)
array([[-1.,  1.,  0.],
       [-1.,  0.,  1.]])

helmert_coding¶

helmert_coding(levels: list[str]) -> NDArray[np.float64]

Build Helmert contrast matrix.

Helmert coding compares each level to the mean of all previous levels. Column j contrasts level j+1 against the average of levels 0..j.

This is equivalent to R’s contr.helmert() (scaled to unit contrasts).

Matrix structure for 4 levels::

Contrast   | A       | B       | C       | D
-----------|---------|---------|---------|--------
B vs A     | -1/2    |  1/2    |  0      |  0
C vs A,B   | -1/3    | -1/3    |  2/3    |  0
D vs A,B,C | -1/4    | -1/4    | -1/4    |  3/4

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names. Must have >= 2 elements.	required

Returns:

Type	Description
`NDArray[float64]`	Contrast matrix of shape (n_levels, n_levels - 1).
`NDArray[float64]`	Row order matches input levels order.

Examples:

>>> helmert_coding(['A', 'B', 'C'])
array([[-0.5       , -0.33333333],
       [ 0.5       , -0.33333333],
       [ 0.        ,  0.66666667]])

helmert_coding_labels¶

helmert_coding_labels(levels: list[str]) -> list[str]

Get column labels for Helmert contrast.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required

Returns:

Type	Description
`list[str]`	List of labels like [‘B vs prev’, ‘C vs prev’, ...].

Examples:

>>> helmert_coding_labels(['A', 'B', 'C'])
['B vs prev', 'C vs prev']

poly_coding¶

poly_coding(levels: list[str]) -> NDArray[np.float64]

Build orthogonal polynomial contrast matrix.

Polynomial coding creates orthogonal contrasts representing linear, quadratic, cubic, etc. trends across ordered factor levels. This is equivalent to R’s contr.poly() function.

The contrasts are orthonormal (orthogonal and unit length), making them suitable for testing polynomial trends in ordered categorical variables.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names. The order determines the polynomial evaluation points.	required

Returns:

Type	Description
`NDArray[float64]`	Contrast matrix of shape (n_levels, n_levels - 1).
`NDArray[float64]`	Column 0 is linear (.L), column 1 is quadratic (.Q), etc.

Examples:

>>> poly_coding(['low', 'medium', 'high'])
array([[-0.70710678,  0.40824829],
       [ 0.        , -0.81649658],
       [ 0.70710678,  0.40824829]])

>>> poly_coding(['A', 'B', 'C', 'D'])
array([[-0.67082039,  0.5       , -0.2236068 ],
       [-0.2236068 , -0.5       ,  0.67082039],
       [ 0.2236068 , -0.5       , -0.67082039],
       [ 0.67082039,  0.5       ,  0.2236068 ]])

Note: Level names are not used in computation - only the count and order matter. The polynomial is evaluated at equally-spaced points 1, 2, ..., n.

poly_coding_labels¶

poly_coding_labels(levels: list[str]) -> list[str]

Get column labels for polynomial contrast.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required

Returns:

Type	Description
`list[str]`	List of polynomial degree labels: `['.L', '.Q', '.C', '^4', '^5', ...]`.

sequential_coding¶

sequential_coding(levels: list[str]) -> NDArray[np.float64]

Build sequential (successive differences) contrast matrix.

Sequential coding compares each level to the previous level, producing contrasts that capture successive differences. This is equivalent to R’s MASS::contr.sdif() function.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required

Returns:

Type	Description
`NDArray[float64]`	Contrast matrix of shape (n_levels, n_levels - 1).
`NDArray[float64]`	Column j represents the contrast: level[j+1] - level[j].

Examples:

>>> sequential_coding(['A', 'B', 'C'])
array([[-0.66666667, -0.33333333],
       [ 0.33333333, -0.33333333],
       [ 0.33333333,  0.66666667]])

>>> sequential_coding(['low', 'medium', 'high', 'very_high'])
array([[-0.75, -0.5 , -0.25],
       [ 0.25, -0.5 , -0.25],
       [ 0.25,  0.5 , -0.25],
       [ 0.25,  0.5 ,  0.75]])

The matrix structure ensures that multiplying by the coefficient vector gives interpretable successive differences.

sequential_coding_labels¶

sequential_coding_labels(levels: list[str]) -> list[str]

Get column labels for sequential contrast.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required

Returns:

Type	Description
`list[str]`	List of successive difference labels like [‘B-A’, ‘C-B’, ...].

Examples:

>>> sequential_coding_labels(['A', 'B', 'C'])
['B-A', 'C-B']

>>> sequential_coding_labels(['low', 'medium', 'high'])
['medium-low', 'high-medium']

sum_coding¶

sum_coding(levels: list[str], omit: str | None = None) -> NDArray[np.float64]

Build sum (effects) contrast matrix.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required
`omit`	`str \| None`	Level to omit (gets -1s). Defaults to last level.	`None`

Returns:

Type	Description
`NDArray[float64]`	Contrast matrix of shape (n_levels, n_levels - 1).
`NDArray[float64]`	Row order matches input levels order.

Examples:

>>> sum_coding(['A', 'B', 'C'])
array([[ 1.,  0.],
       [ 0.,  1.],
       [-1., -1.]])

>>> sum_coding(['A', 'B', 'C'], omit='A')
array([[-1., -1.],
       [ 1.,  0.],
       [ 0.,  1.]])

sum_coding_labels¶

sum_coding_labels(levels: list[str], omit: str | None = None) -> list[str]

Get column labels for sum contrast.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required
`omit`	`str \| None`	Level to omit. Defaults to last level.	`None`

Returns:

Type	Description
`list[str]`	List of non-omitted level names (column labels).

treatment_coding¶

treatment_coding(levels: list[str], reference: str | None = None) -> NDArray[np.float64]

Build treatment (dummy) contrast matrix.

Treatment coding sets the reference level to all zeros, and each other level gets a one-hot encoded row. This is the most common coding for regression models with an intercept.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required
`reference`	`str \| None`	Reference level name. Defaults to first level.	`None`

Returns:

Type	Description
`NDArray[float64]`	Contrast matrix of shape (n_levels, n_levels - 1).
`NDArray[float64]`	Row order matches input levels order.

Examples:

>>> treatment_coding(['A', 'B', 'C'])
array([[0., 0.],
       [1., 0.],
       [0., 1.]])

>>> treatment_coding(['A', 'B', 'C'], reference='B')
array([[1., 0.],
       [0., 0.],
       [0., 1.]])

treatment_coding_labels¶

treatment_coding_labels(levels: list[str], reference: str | None = None) -> list[str]

Get column labels for treatment contrast.

Parameters:

Name	Type	Description	Default
`levels`	`list[str]`	Ordered list of categorical level names.	required
`reference`	`str \| None`	Reference level name. Defaults to first level.	`None`

Returns:

Type	Description
`list[str]`	List of non-reference level names (column labels).

names¶

Design matrix column name parsing and variable type detection.

Classes:

Name	Description
`DesignColumnInfo`	Parsed design matrix column metadata.

Functions:

Name	Description
`extract_base_term`	Extract base term name from column name.
`extract_categorical_variables`	Find all categorical base variable names from design matrix columns.
`identify_column_type`	Identify column type from name (simplified version).
`parse_design_column_name`	Parse design matrix column name into components.

Classes¶

DesignColumnInfo¶

DesignColumnInfo(raw_name: str, base_term: str, level: str | None, column_type: Literal['intercept', 'continuous', 'categorical'], is_interaction: bool = False) -> None

Parsed design matrix column metadata.

Attributes:

Name	Type	Description
`raw_name`	`str`	Original column name (e.g., “treatment[A]”).
`base_term`	`str`	Base variable name without level (e.g., “treatment”).
`level`	`str \| None`	Level value for categorical, None for continuous (e.g., “A”).
`column_type`	`Literal[‘intercept’, ‘continuous’, ‘categorical’]`	Type classification.
`is_interaction`	`bool`	Whether this is an interaction term.

Attributes¶

base_term¶

base_term: str

column_type¶

column_type: Literal['intercept', 'continuous', 'categorical']

is_interaction¶

is_interaction: bool = False

level¶

level: str | None

raw_name¶

raw_name: str

Functions¶

extract_base_term¶

extract_base_term(name: str) -> str

Extract base term name from column name.

For categorical variables, strips the level suffix. For interactions, extracts base terms without levels.

Parameters:

Name	Type	Description	Default
`name`	`str`	Column name from design matrix.	required

Returns:

Type	Description
`str`	Base term name.

Examples:

>>> extract_base_term("treatment[A]")
'treatment'
>>> extract_base_term("x")
'x'
>>> extract_base_term("treatment[A]:x")
'treatment:x'

extract_categorical_variables¶

extract_categorical_variables(X_names: tuple[str, ...] | list[str]) -> set[str]

Find all categorical base variable names from design matrix columns.

Scans column names for bracket patterns and extracts unique base terms.

Parameters:

Name	Type	Description	Default
`X_names`	`tuple[str, ...] \| list[str]`	Column names from design matrix.	required

Returns:

Type	Description
`set[str]`	Set of categorical variable names (base terms, no levels).

Examples:

>>> extract_categorical_variables(["Intercept", "x", "treatment[A]", "treatment[B]"])
{'treatment'}

identify_column_type¶

identify_column_type(name: str) -> Literal['intercept', 'continuous', 'categorical']

Identify column type from name (simplified version).

This is a lightweight alternative to parse_design_column_name() when only the type is needed.

Parameters:

Name	Type	Description	Default
`name`	`str`	Column name from design matrix.	required

Returns:

Type	Description
`Literal[‘intercept’, ‘continuous’, ‘categorical’]`	Column type as string literal.

parse_design_column_name¶

parse_design_column_name(name: str) -> DesignColumnInfo

Parse design matrix column name into components.

Handles standard R/formula naming conventions:

“Intercept” → intercept type
“x” → continuous variable
“treatment[A]” → categorical dummy for level A
“x:z” → continuous interaction
“treatment[A]:x” → categorical × continuous interaction

Parameters:

Name	Type	Description	Default
`name`	`str`	Column name from design matrix.	required

Returns:

Type	Description
`DesignColumnInfo`	DesignColumnInfo with parsed components.

Examples:

>>> parse_design_column_name("Intercept")
DesignColumnInfo(raw_name='Intercept', base_term='Intercept',
                 level=None, column_type='intercept', is_interaction=False)

>>> parse_design_column_name("x")
DesignColumnInfo(raw_name='x', base_term='x',
                 level=None, column_type='continuous', is_interaction=False)

>>> parse_design_column_name("treatment[A]")
DesignColumnInfo(raw_name='treatment[A]', base_term='treatment',
                 level='A', column_type='categorical', is_interaction=False)

>>> parse_design_column_name("treatment[A]:x")
DesignColumnInfo(raw_name='treatment[A]:x', base_term='treatment:x',
                 level='A', column_type='categorical', is_interaction=True)

reference¶

Reference design matrix (X_ref) construction for marginal effects.

Functions:

Name	Description
`build_continuous_reference_matrix`	Build reference matrix for a continuous focal variable at specific values.
`build_counterfactual_design_matrices`	Build counterfactual design matrices for g-computation.
`build_reference_design_matrix`	Build design matrix for reference grid points.
`build_reference_row`	Build a single row of the reference design matrix.

Functions¶

build_continuous_reference_matrix¶

build_continuous_reference_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, at_values: tuple[float, ...], X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarray

Build reference matrix for a continuous focal variable at specific values.

Creates one row per value in at_values. Each row has covariates at their means except the focal variable, which is set to the specified value.

For interaction columns involving the focal variable, the interaction value is properly computed as a product of component values at each at_value.

Parameters:

Name	Type	Description	Default
`X_names`	`tuple[str, ...] \| list[str]`	Column names from the design matrix.	required
`focal_var`	`str`	Name of the continuous focal variable.	required
`at_values`	`tuple[float, ...]`	Specific values to evaluate the focal variable at.	required
`X_means`	`ndarray`	Column means of the original design matrix, shape (p,).	required
`set_categoricals`	`dict[str, str] \| None`	Optional dict mapping non-focal categorical variable names to specific levels for indicator encoding instead of marginalizing at column means.	`None`

Returns:

Type	Description
`ndarray`	Reference design matrix X_ref, shape (n_values, p).

build_counterfactual_design_matrices¶

build_counterfactual_design_matrices(X: np.ndarray, X_names: tuple[str, ...] | list[str], focal_var: str, levels: list[str]) -> list[np.ndarray]

Build counterfactual design matrices for g-computation.

For each focal level, creates a modified copy of the full design matrix X where the focal variable’s indicator columns are set to match that level and interaction columns involving the focal variable are recomputed from component values. Non-focal columns are left unchanged, preserving the observed covariate distribution.

This is the core building block for weights="observed" (g-computation / counterfactual prediction). Each returned matrix answers: “What would the design matrix look like if every observation were assigned to this focal level?”

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Original design matrix, shape `(N, p)`.	required
`X_names`	`tuple[str, ...] \| list[str]`	Column names from the design matrix, in order.	required
`focal_var`	`str`	Name of the categorical focal variable.	required
`levels`	`list[str]`	List of levels to compute counterfactuals for.	required

Returns:

Type	Description
`list[ndarray]`	List of counterfactual design matrices (one per level), each shape
`list[ndarray]`	`(N, p)`. Order matches `levels`.

Examples:

For a model y ~ treatment + x + treatment:x with X_names = ("Intercept", "x", "treatment[B]", "x:treatment[B]")::

mats = build_counterfactual_design_matrices(X, X_names, "treatment", ["ref", "B"])
# mats[0]: treatment set to ref for all rows (treatment[B]=0, x:treatment[B]=0)
# mats[1]: treatment set to B for all rows (treatment[B]=1, x:treatment[B]=x_i)

build_reference_design_matrix¶

build_reference_design_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, levels: list[str], X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarray

Build design matrix for reference grid points.

Reference value conventions:

Intercept: 1.0
Focal variable dummies: 1.0 if matching level, 0.0 otherwise
Continuous covariates: column mean from X_means
Non-focal categorical dummies: column mean (marginalize over observed proportions)

Parameters:

Name	Type	Description	Default
`X_names`	`tuple[str, ...] \| list[str]`	Column names from the design matrix, in order.	required
`focal_var`	`str`	Name of the categorical variable to vary across levels.	required
`levels`	`list[str]`	List of levels for the focal variable, defining row order.	required
`X_means`	`ndarray`	Column means of the original design matrix, shape (p,). Used for continuous covariate reference values.	required
`set_categoricals`	`dict[str, str] \| None`	Optional dict mapping non-focal categorical variable names to specific levels to pin them at (instead of marginalizing at X_means). E.g. `{"Ethnicity": "Asian"}` sets the Ethnicity dummies to indicator values for “Asian”.	`None`

Returns:

Type	Description
`ndarray`	Reference design matrix X_ref, shape (n_levels, p).

Examples:

Compute X_ref for treatment EMMs::

X_names = ("Intercept", "x", "treatment[A]", "treatment[B]")
X_means = np.array([1.0, 2.5, 0.33, 0.33])  # means from data
levels = ["ref", "A", "B"]

X_ref = build_reference_design_matrix(X_names, "treatment", levels, X_means)
# X_ref[0] = [1.0, 2.5, 0.0, 0.0]  # reference level
# X_ref[1] = [1.0, 2.5, 1.0, 0.0]  # level A
# X_ref[2] = [1.0, 2.5, 0.0, 1.0]  # level B

Note: The first level is typically the reference level (omitted from dummy coding), so its row has all 0s for the focal variable dummies.

build_reference_row¶

build_reference_row(X_names: tuple[str, ...] | list[str], focal_var: str, focal_level: str, X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarray

Build a single row of the reference design matrix.

Creates one reference point where the focal variable is set to the specified level and other covariates are at reference values.

Parameters:

Name	Type	Description	Default
`X_names`	`tuple[str, ...] \| list[str]`	Column names from the design matrix.	required
`focal_var`	`str`	Name of the focal categorical variable.	required
`focal_level`	`str`	Level value to set for the focal variable.	required
`X_means`	`ndarray`	Column means for continuous covariate reference values.	required
`set_categoricals`	`dict[str, str] \| None`	Optional dict mapping non-focal categorical variable names to specific levels for indicator encoding. When a non-focal categorical’s `base_term` matches a key, the dummy is set to `1.0` if the level matches, `0.0` otherwise (instead of using the column mean for marginalization).	`None`

Returns:

Type	Description
`ndarray`	Reference row, shape (p,).

z_matrix¶

Sparse Z matrix (random effects design matrix) construction.

Classes:

Name	Description
`RandomEffectsInfo`	Complete random effects specification for lmer/glmer.

Functions:

Name	Description
`build_random_effects`	Build complete random effects specification.
`build_z_crossed`	Build Z matrix for crossed random effects.
`build_z_nested`	Build Z matrix for nested random effects.
`build_z_simple`	Build Z matrix for single grouping factor.

Classes¶

RandomEffectsInfo¶

RandomEffectsInfo(Z: sp.csc_matrix, group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, re_structures_list: list[str] | None = None, re_dims_list: list[int] | None = None, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, column_labels: list[str] = list(), term_permutation: NDArray[np.intp] | None = None) -> None

Complete random effects specification for lmer/glmer.

This container holds the Z matrix and all metadata needed for downstream operations (Lambda building, initialization, results).

Attributes:

Name	Type	Description
`Z`	`csc_matrix`	Sparse random effects design matrix, shape (n, q).
`group_ids_list`	`list[NDArray[intp]]`	Group ID arrays for each factor.
`n_groups_list`	`list[int]`	Number of groups per factor.
`group_names`	`list[str]`	Names of grouping factors.
`random_names`	`list[str]`	Names of random effect terms.
`re_structure`	`str`	Overall structure type (intercept/slope/diagonal/nested/crossed).
`re_structures_list`	`list[str] \| None`	Per-factor structure types (for mixed).
`X_re`	`NDArray[float64] \| list[NDArray[float64]] \| None`	Random effects covariates (for slopes).
`column_labels`	`list[str]`	Z column names for output.
`term_permutation`	`NDArray[intp] \| None`	Block ordering permutation indices.

Attributes¶

X_re¶

X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None

Z¶

Z: sp.csc_matrix

column_labels¶

column_labels: list[str] = field(default_factory=list)

group_ids_list¶

group_ids_list: list[NDArray[np.intp]]

group_names¶

group_names: list[str]

n_groups_list¶

n_groups_list: list[int]

random_names¶

random_names: list[str]

re_dims_list¶

re_dims_list: list[int] | None = None

re_structure¶

re_structure: str

re_structures_list¶

re_structures_list: list[str] | None = None

term_permutation¶

term_permutation: NDArray[np.intp] | None = None

Functions¶

build_random_effects¶

build_random_effects(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, re_structures_list: list[str] | None = None, group_levels_list: list[list[str]] | None = None, term_permutation: NDArray[np.intp] | None = None) -> RandomEffectsInfo

Build complete random effects specification.

High-level function that constructs the Z matrix and packages all metadata into a RandomEffectsInfo container ready for lmer/glmer consumption.

Parameters:

Name	Type	Description	Default
`group_ids_list`	`list[NDArray[intp]]`	Group ID arrays for each factor.	required
`n_groups_list`	`list[int]`	Number of groups per factor.	required
`group_names`	`list[str]`	Names of grouping factors.	required
`random_names`	`list[str]`	Names of random effect terms.	required
`re_structure`	`str`	Overall structure type: - “intercept”: random intercept only - “slope”: correlated intercept + slopes - “diagonal”: uncorrelated intercept + slopes - “nested”: nested hierarchy - “crossed”: crossed factors	required
`X_re`	`NDArray[float64] \| list[NDArray[float64]] \| None`	Random effects covariates (for slopes).	`None`
`re_structures_list`	`list[str] \| None`	Per-factor structure (for mixed).	`None`
`group_levels_list`	`list[list[str]] \| None`	Level names per factor (for labels).	`None`
`term_permutation`	`NDArray[intp] \| None`	Block ordering permutation.	`None`

Returns:

Type	Description
`RandomEffectsInfo`	RandomEffectsInfo with Z matrix and all metadata.

Examples:

>>> # (Days|Subject) with 18 subjects
>>> group_ids = np.arange(180) // 10
>>> n_groups = 18
>>> X_re = np.column_stack([np.ones(180), np.tile(np.arange(10), 18)])
>>> info = build_random_effects(
...     group_ids_list=[group_ids],
...     n_groups_list=[n_groups],
...     group_names=["Subject"],
...     random_names=["Intercept", "Days"],
...     re_structure="slope",
...     X_re=X_re,
... )
>>> info.Z.shape
(180, 36)  # 18 subjects * 2 RE

build_z_crossed¶

build_z_crossed(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None, layouts: list[str] | None = None) -> sp.csc_matrix

Build Z matrix for crossed random effects.

Crossed effects like (1|subject) + (1|item) create independent random effects for each factor. The Z matrix is a horizontal concatenation of Z matrices for each factor.

Parameters:

Name	Type	Description	Default
`group_ids_list`	`list[NDArray[intp]]`	List of group ID arrays, one per factor.	required
`n_groups_list`	`list[int]`	Number of groups per factor.	required
`X_re_list`	`list[NDArray[float64] \| None] \| None`	Random effects design per factor, or None for intercepts.	`None`
`layouts`	`list[str] \| None`	Layout per factor. Default: interleaved for all.	`None`

Returns:

Type	Description
`csc_matrix`	Sparse Z matrix, shape (n, sum(n_groups_i * n_re_i)).

Examples:

>>> # (1|subject) + (1|item) with 3 subjects, 4 items
>>> subj_ids = np.array([0, 1, 2, 0, 1, 2])
>>> item_ids = np.array([0, 0, 0, 1, 1, 1])
>>> Z = build_z_crossed(
...     group_ids_list=[subj_ids, item_ids],
...     n_groups_list=[3, 4]
... )
>>> Z.shape
(6, 7)  # 3 subject + 4 item columns

build_z_nested¶

build_z_nested(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None) -> sp.csc_matrix

Build Z matrix for nested random effects.

Nested effects like (1|school/class) create separate random intercepts for each level of the hierarchy. The Z matrix is a horizontal concatenation of Z matrices for each level.

Parameters:

Name	Type	Description	Default
`group_ids_list`	`list[NDArray[intp]]`	List of group ID arrays, ordered [inner, ..., outer]. For (1	school/class): [class_ids, school_ids].
`n_groups_list`	`list[int]`	Number of groups at each level.	required
`X_re_list`	`list[NDArray[float64] \| None] \| None`	Random effects design per level, or None for intercepts.	`None`

Returns:

Type	Description
`csc_matrix`	Sparse Z matrix, shape (n, sum(n_groups_i * n_re_i)).

Examples:

>>> # (1|school/class) with 2 schools, 4 classes total
>>> class_ids = np.array([0, 0, 1, 1, 2, 2, 3, 3])
>>> school_ids = np.array([0, 0, 0, 0, 1, 1, 1, 1])
>>> Z = build_z_nested(
...     group_ids_list=[class_ids, school_ids],
...     n_groups_list=[4, 2]
... )
>>> Z.shape
(8, 6)  # 4 class columns + 2 school columns

build_z_simple¶

build_z_simple(group_ids: NDArray[np.intp], n_groups: int, X_re: NDArray[np.float64] | None = None, layout: Literal['interleaved', 'blocked'] = 'interleaved') -> sp.csc_matrix

Build Z matrix for single grouping factor.

Constructs Z directly in sparse COO format without dense intermediates. For large-scale data (e.g., InstEval with 73k obs x 4k groups), this uses O(n x n_re) memory instead of O(n x n_groups x n_re).

Handles intercept-only, correlated slopes, and uncorrelated slopes by varying the X_re input and layout parameter.

Parameters:

Name	Type	Description	Default
`group_ids`	`NDArray[intp]`	Array of group assignments, shape (n,), values 0..n_groups-1.	required
`n_groups`	`int`	Total number of groups.	required
`X_re`	`NDArray[float64] \| None`	Random effects design matrix, shape (n, n_re). - None or column of 1s: intercept only - Multiple columns: intercept + slopes	`None`
`layout`	`Literal[‘interleaved’, ‘blocked’]`	Column ordering. - “interleaved”: [g1_int, g1_slope, g2_int, g2_slope, ...] - “blocked”: [g1_int, g2_int, ..., g1_slope, g2_slope, ...]	`‘interleaved’`

Returns:

Type	Description
`csc_matrix`	Sparse Z matrix in CSC format, shape (n, n_groups * n_re).

Examples:

>>> # Random intercept only
>>> group_ids = np.array([0, 0, 1, 1])
>>> Z = build_z_simple(group_ids, n_groups=2)
>>> Z.shape
(4, 2)

>>> # Random intercept + slope (correlated)
>>> X_re = np.column_stack([np.ones(4), [1, 2, 1, 2]])
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="interleaved")
>>> Z.shape
(4, 4)

>>> # Uncorrelated random effects
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="blocked")
>>> Z.shape
(4, 4)