Design matrix construction — coding, naming, reference grids, random effects.
Call chain:
formula.build_design_matrices() -> treatment_coding() / sum_coding() / ... (categorical columns)
marginal.build_reference_grid() -> build_reference_row() (EMM reference grids)
formula.build_random_effects_from_spec() -> build_z_simple() / build_z_nested() / build_z_crossed()Classes:
| Name | Description |
|---|---|
DesignColumnInfo | Parsed design matrix column metadata. |
RandomEffectsInfo | Complete random effects specification for lmer/glmer. |
Functions:
| Name | Description |
|---|---|
array_to_coding_matrix | Convert user-specified contrasts to a coding matrix for design matrices. |
build_random_effects | Build complete random effects specification. |
build_reference_design_matrix | Build design matrix for reference grid points. |
build_reference_row | Build a single row of the reference design matrix. |
build_slope_reference_matrix | Build reference matrices for computing marginal slopes. |
build_z_crossed | Build Z matrix for crossed random effects. |
build_z_nested | Build Z matrix for nested random effects. |
build_z_simple | Build Z matrix for single grouping factor. |
convert_coding_to_hypothesis | Convert a coding matrix back to interpretable hypothesis contrasts. |
extract_base_term | Extract base term name from column name. |
extract_categorical_variables | Find all categorical base variable names from design matrix columns. |
extract_level_from_column | Extract level value for a specific focal variable from column name. |
helmert_coding | Build Helmert contrast matrix. |
helmert_coding_labels | Get column labels for Helmert contrast. |
identify_column_type | Identify column type from name (simplified version). |
parse_design_column_name | Parse design matrix column name into components. |
poly_coding | Build orthogonal polynomial contrast matrix. |
poly_coding_labels | Get column labels for polynomial contrast. |
sequential_coding | Build sequential (successive differences) contrast matrix. |
sequential_coding_labels | Get column labels for sequential contrast. |
sum_coding | Build sum (effects) contrast matrix. |
sum_coding_labels | Get column labels for sum contrast. |
treatment_coding | Build treatment (dummy) contrast matrix. |
treatment_coding_labels | Get column labels for treatment contrast. |
Modules:
| Name | Description |
|---|---|
coding | Contrast matrix builders for categorical variable encoding. |
names | Design matrix column name parsing and variable type detection. |
reference | Reference design matrix (X_ref) construction for marginal effects. |
z_matrix | Sparse Z matrix (random effects design matrix) construction. |
Classes¶
DesignColumnInfo¶
DesignColumnInfo(raw_name: str, base_term: str, level: str | None, column_type: Literal['intercept', 'continuous', 'categorical'], is_interaction: bool = False) -> NoneParsed design matrix column metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
raw_name | str | Original column name (e.g., “treatment[A]”). |
base_term | str | Base variable name without level (e.g., “treatment”). |
level | str | None | Level value for categorical, None for continuous (e.g., “A”). |
column_type | Literal[‘intercept’, ‘continuous’, ‘categorical’] | Type classification. |
is_interaction | bool | Whether this is an interaction term. |
Attributes¶
base_term¶
base_term: strcolumn_type¶
column_type: Literal['intercept', 'continuous', 'categorical']is_interaction¶
is_interaction: bool = Falselevel¶
level: str | Noneraw_name¶
raw_name: strRandomEffectsInfo¶
RandomEffectsInfo(Z: sp.csc_matrix, group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, re_structures_list: list[str] | None = None, re_dims_list: list[int] | None = None, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, column_labels: list[str] = list(), term_permutation: NDArray[np.intp] | None = None) -> NoneComplete random effects specification for lmer/glmer.
This container holds the Z matrix and all metadata needed for downstream operations (Lambda building, initialization, results).
Attributes:
| Name | Type | Description |
|---|---|---|
Z | csc_matrix | Sparse random effects design matrix, shape (n, q). |
group_ids_list | list[NDArray[intp]] | Group ID arrays for each factor. |
n_groups_list | list[int] | Number of groups per factor. |
group_names | list[str] | Names of grouping factors. |
random_names | list[str] | Names of random effect terms. |
re_structure | str | Overall structure type (intercept/slope/diagonal/nested/crossed). |
re_structures_list | list[str] | None | Per-factor structure types (for mixed). |
X_re | NDArray[float64] | list[NDArray[float64]] | None | Random effects covariates (for slopes). |
column_labels | list[str] | Z column names for output. |
term_permutation | NDArray[intp] | None | Block ordering permutation indices. |
Attributes¶
X_re¶
X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = NoneZ¶
Z: sp.csc_matrixcolumn_labels¶
column_labels: list[str] = field(default_factory=list)group_ids_list¶
group_ids_list: list[NDArray[np.intp]]group_names¶
group_names: list[str]n_groups_list¶
n_groups_list: list[int]random_names¶
random_names: list[str]re_dims_list¶
re_dims_list: list[int] | None = Nonere_structure¶
re_structure: strre_structures_list¶
re_structures_list: list[str] | None = Noneterm_permutation¶
term_permutation: NDArray[np.intp] | None = NoneFunctions¶
array_to_coding_matrix¶
array_to_coding_matrix(contrasts: NDArray[np.floating] | list[float] | list[list[float]], n_levels: int, *, normalize: bool = True) -> NDArray[np.float64]Convert user-specified contrasts to a coding matrix for design matrices.
This function converts “human-readable” contrast specifications (where each row represents a hypothesis like “A vs average(B, C)”) into a coding matrix suitable for use in regression design matrices.
The algorithm uses QR decomposition to auto-complete under-specified contrasts with orthogonal contrasts, following the approach from R’s gmodels::make.contrasts() and pymer4’s con2R().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
contrasts | NDArray[floating] | list[float] | list[list[float]] | User-specified contrasts as: - 1D array/list: Single contrast vector of length n_levels - 2D array/list: Multiple contrasts, shape (n_contrasts, n_levels) Each row sums to zero for valid contrasts. | required |
n_levels | int | Number of factor levels. Must match contrast dimensions. | required |
normalize | bool | If True, normalize each contrast vector by its L2 norm before conversion. This puts contrasts in standard-deviation units similar to orthogonal polynomial contrasts. | True |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Coding matrix of shape (n_levels, n_levels - 1). Each row corresponds |
NDArray[float64] | to a factor level, each column to a design matrix column. |
Examples:
>>> # Single contrast: A vs average(B, C)
>>> array_to_coding_matrix([-1, 0.5, 0.5], n_levels=3)
array([[-0.81649658, 0. ],
[ 0.40824829, -0.70710678],
[ 0.40824829, 0.70710678]])>>> # Multiple contrasts: A vs B, and (A,B) vs C
>>> array_to_coding_matrix([[-1, 1, 0], [-0.5, -0.5, 1]], n_levels=3)
array([[-0.5 , -0.28867513],
[ 0.5 , -0.28867513],
[ 0. , 0.57735027]])Note: The returned matrix has n_levels-1 columns because one degree of freedom is absorbed by the intercept. If you specify fewer than n_levels-1 contrasts, the remaining columns are auto-completed with orthogonal contrasts via QR decomposition.
build_random_effects¶
build_random_effects(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, re_structures_list: list[str] | None = None, group_levels_list: list[list[str]] | None = None, term_permutation: NDArray[np.intp] | None = None) -> RandomEffectsInfoBuild complete random effects specification.
High-level function that constructs the Z matrix and packages all metadata into a RandomEffectsInfo container ready for lmer/glmer consumption.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_ids_list | list[NDArray[intp]] | Group ID arrays for each factor. | required |
n_groups_list | list[int] | Number of groups per factor. | required |
group_names | list[str] | Names of grouping factors. | required |
random_names | list[str] | Names of random effect terms. | required |
re_structure | str | Overall structure type: - “intercept”: random intercept only - “slope”: correlated intercept + slopes - “diagonal”: uncorrelated intercept + slopes - “nested”: nested hierarchy - “crossed”: crossed factors | required |
X_re | NDArray[float64] | list[NDArray[float64]] | None | Random effects covariates (for slopes). | None |
re_structures_list | list[str] | None | Per-factor structure (for mixed). | None |
group_levels_list | list[list[str]] | None | Level names per factor (for labels). | None |
term_permutation | NDArray[intp] | None | Block ordering permutation. | None |
Returns:
| Type | Description |
|---|---|
RandomEffectsInfo | RandomEffectsInfo with Z matrix and all metadata. |
Examples:
>>> # (Days|Subject) with 18 subjects
>>> group_ids = np.arange(180) // 10
>>> n_groups = 18
>>> X_re = np.column_stack([np.ones(180), np.tile(np.arange(10), 18)])
>>> info = build_random_effects(
... group_ids_list=[group_ids],
... n_groups_list=[n_groups],
... group_names=["Subject"],
... random_names=["Intercept", "Days"],
... re_structure="slope",
... X_re=X_re,
... )
>>> info.Z.shape
(180, 36) # 18 subjects * 2 REbuild_reference_design_matrix¶
build_reference_design_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, levels: list[str], X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarrayBuild design matrix for reference grid points.
Creates an X_ref matrix with one row per focal variable level. Each row represents a reference point where the focal variable is set to that level and all other covariates are set to their reference values.
Reference value conventions:
Intercept: 1.0
Focal variable dummies: 1.0 if matching level, 0.0 otherwise
Continuous covariates: column mean from X_means
Non-focal categorical dummies: column mean (marginalize over observed proportions)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_names | tuple[str, ...] | list[str] | Column names from the design matrix, in order. | required |
focal_var | str | Name of the categorical variable to vary across levels. | required |
levels | list[str] | List of levels for the focal variable, defining row order. | required |
X_means | ndarray | Column means of the original design matrix, shape (p,). Used for continuous covariate reference values. | required |
set_categoricals | dict[str, str] | None | Optional dict mapping non-focal categorical variable names to specific levels to pin them at (instead of marginalizing at X_means). E.g. {"Ethnicity": "Asian"} sets the Ethnicity dummies to indicator values for “Asian”. | None |
Returns:
| Type | Description |
|---|---|
ndarray | Reference design matrix X_ref, shape (n_levels, p). |
Examples:
Compute X_ref for treatment EMMs::
X_names = ("Intercept", "x", "treatment[A]", "treatment[B]")
X_means = np.array([1.0, 2.5, 0.33, 0.33]) # means from data
levels = ["ref", "A", "B"]
X_ref = build_reference_design_matrix(X_names, "treatment", levels, X_means)
# X_ref[0] = [1.0, 2.5, 0.0, 0.0] # reference level
# X_ref[1] = [1.0, 2.5, 1.0, 0.0] # level A
# X_ref[2] = [1.0, 2.5, 0.0, 1.0] # level BNote: The first level is typically the reference level (omitted from dummy coding), so its row has all 0s for the focal variable dummies.
build_reference_row¶
build_reference_row(X_names: tuple[str, ...] | list[str], focal_var: str, focal_level: str, X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarrayBuild a single row of the reference design matrix.
Creates one reference point where the focal variable is set to the specified level and other covariates are at reference values.
For interaction columns involving the focal variable (e.g.,
Income:Student[Yes] when focal_var="Student"), the value is
computed as the product of component values rather than using the
empirical mean of the interaction column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_names | tuple[str, ...] | list[str] | Column names from the design matrix. | required |
focal_var | str | Name of the focal categorical variable. | required |
focal_level | str | Level value to set for the focal variable. | required |
X_means | ndarray | Column means for continuous covariate reference values. | required |
set_categoricals | dict[str, str] | None | Optional dict mapping non-focal categorical variable names to specific levels for indicator encoding. When a non-focal categorical’s base_term matches a key, the dummy is set to 1.0 if the level matches, 0.0 otherwise (instead of using the column mean for marginalization). | None |
Returns:
| Type | Description |
|---|---|
ndarray | Reference row, shape (p,). |
build_slope_reference_matrix¶
build_slope_reference_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, X_means: np.ndarray, *, delta: float = 1.0) -> tuple[np.ndarray, np.ndarray]Build reference matrices for computing marginal slopes.
Creates two reference rows: one at the mean and one at mean + delta for the focal continuous variable. The slope is then (y1 - y0) / delta.
For interaction columns involving the focal variable (e.g.,
x:z when focal_var="x"), the interaction value is properly
computed as a product of component values, so the perturbed row
reflects the interaction contribution to the slope.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_names | tuple[str, ...] | list[str] | Column names from the design matrix. | required |
focal_var | str | Name of the continuous variable for slope computation. | required |
X_means | ndarray | Column means of the original design matrix. | required |
delta | float | Step size for numerical differentiation (default 1.0). | 1.0 |
Returns:
| Type | Description |
|---|---|
ndarray | Tuple of (X_ref_0, X_ref_1) where: |
ndarray | - X_ref_0: Reference point at mean of focal_var |
tuple[ndarray, ndarray] | - X_ref_1: Reference point at mean + delta of focal_var |
build_z_crossed¶
build_z_crossed(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None, layouts: list[str] | None = None) -> sp.csc_matrixBuild Z matrix for crossed random effects.
Crossed effects like (1|subject) + (1|item) create independent random effects for each factor. The Z matrix is a horizontal concatenation of Z matrices for each factor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_ids_list | list[NDArray[intp]] | List of group ID arrays, one per factor. | required |
n_groups_list | list[int] | Number of groups per factor. | required |
X_re_list | list[NDArray[float64] | None] | None | Random effects design per factor, or None for intercepts. | None |
layouts | list[str] | None | Layout per factor. Default: interleaved for all. | None |
Returns:
| Type | Description |
|---|---|
csc_matrix | Sparse Z matrix, shape (n, sum(n_groups_i * n_re_i)). |
Examples:
>>> # (1|subject) + (1|item) with 3 subjects, 4 items
>>> subj_ids = np.array([0, 1, 2, 0, 1, 2])
>>> item_ids = np.array([0, 0, 0, 1, 1, 1])
>>> Z = build_z_crossed(
... group_ids_list=[subj_ids, item_ids],
... n_groups_list=[3, 4]
... )
>>> Z.shape
(6, 7) # 3 subject + 4 item columnsbuild_z_nested¶
build_z_nested(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None) -> sp.csc_matrixBuild Z matrix for nested random effects.
Nested effects like (1|school/class) create separate random intercepts for each level of the hierarchy. The Z matrix is a horizontal concatenation of Z matrices for each level.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_ids_list | list[NDArray[intp]] | List of group ID arrays, ordered [inner, ..., outer]. For (1 | school/class): [class_ids, school_ids]. |
n_groups_list | list[int] | Number of groups at each level. | required |
X_re_list | list[NDArray[float64] | None] | None | Random effects design per level, or None for intercepts. | None |
Returns:
| Type | Description |
|---|---|
csc_matrix | Sparse Z matrix, shape (n, sum(n_groups_i * n_re_i)). |
Examples:
>>> # (1|school/class) with 2 schools, 4 classes total
>>> class_ids = np.array([0, 0, 1, 1, 2, 2, 3, 3])
>>> school_ids = np.array([0, 0, 0, 0, 1, 1, 1, 1])
>>> Z = build_z_nested(
... group_ids_list=[class_ids, school_ids],
... n_groups_list=[4, 2]
... )
>>> Z.shape
(8, 6) # 4 class columns + 2 school columnsbuild_z_simple¶
build_z_simple(group_ids: NDArray[np.intp], n_groups: int, X_re: NDArray[np.float64] | None = None, layout: Literal['interleaved', 'blocked'] = 'interleaved') -> sp.csc_matrixBuild Z matrix for single grouping factor.
Constructs Z directly in sparse COO format without dense intermediates. For large-scale data (e.g., InstEval with 73k obs x 4k groups), this uses O(n x n_re) memory instead of O(n x n_groups x n_re).
Handles intercept-only, correlated slopes, and uncorrelated slopes by varying the X_re input and layout parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_ids | NDArray[intp] | Array of group assignments, shape (n,), values 0..n_groups-1. | required |
n_groups | int | Total number of groups. | required |
X_re | NDArray[float64] | None | Random effects design matrix, shape (n, n_re). - None or column of 1s: intercept only - Multiple columns: intercept + slopes | None |
layout | Literal[‘interleaved’, ‘blocked’] | Column ordering. - “interleaved”: [g1_int, g1_slope, g2_int, g2_slope, ...] - “blocked”: [g1_int, g2_int, ..., g1_slope, g2_slope, ...] | ‘interleaved’ |
Returns:
| Type | Description |
|---|---|
csc_matrix | Sparse Z matrix in CSC format, shape (n, n_groups * n_re). |
Examples:
>>> # Random intercept only
>>> group_ids = np.array([0, 0, 1, 1])
>>> Z = build_z_simple(group_ids, n_groups=2)
>>> Z.shape
(4, 2)>>> # Random intercept + slope (correlated)
>>> X_re = np.column_stack([np.ones(4), [1, 2, 1, 2]])
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="interleaved")
>>> Z.shape
(4, 4)>>> # Uncorrelated random effects
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="blocked")
>>> Z.shape
(4, 4)convert_coding_to_hypothesis¶
convert_coding_to_hypothesis(coding_matrix: NDArray[np.float64]) -> NDArray[np.float64]Convert a coding matrix back to interpretable hypothesis contrasts.
This is the inverse of array_to_coding_matrix. Given a coding matrix (n_levels, n_levels-1), returns the hypothesis matrix where each row represents the linear combination of factor levels being compared.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
coding_matrix | NDArray[float64] | Coding matrix of shape (n_levels, n_levels - 1). | required |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Hypothesis matrix of shape (n_levels - 1, n_levels). Each row |
NDArray[float64] | represents a contrast hypothesis (coefficients for factor levels). |
Examples:
>>> cm = treatment_coding(['A', 'B', 'C'])
>>> convert_coding_to_hypothesis(cm)
array([[-1., 1., 0.],
[-1., 0., 1.]])extract_base_term¶
extract_base_term(name: str) -> strExtract base term name from column name.
For categorical variables, strips the level suffix. For interactions, extracts base terms without levels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | Column name from design matrix. | required |
Returns:
| Type | Description |
|---|---|
str | Base term name. |
Examples:
>>> extract_base_term("treatment[A]")
'treatment'
>>> extract_base_term("x")
'x'
>>> extract_base_term("treatment[A]:x")
'treatment:x'extract_categorical_variables¶
extract_categorical_variables(X_names: tuple[str, ...] | list[str]) -> set[str]Find all categorical base variable names from design matrix columns.
Scans column names for bracket patterns and extracts unique base terms.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_names | tuple[str, ...] | list[str] | Column names from design matrix. | required |
Returns:
| Type | Description |
|---|---|
set[str] | Set of categorical variable names (base terms, no levels). |
Examples:
>>> extract_categorical_variables(["Intercept", "x", "treatment[A]", "treatment[B]"])
{'treatment'}extract_level_from_column¶
extract_level_from_column(name: str, focal_var: str) -> str | NoneExtract level value for a specific focal variable from column name.
Used when building reference grids to identify which column corresponds to which level of the focal variable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | Column name (e.g., “treatment[A]” or “treatment[A]:x”). | required |
focal_var | str | The focal variable name (e.g., “treatment”). | required |
Returns:
| Type | Description |
|---|---|
str | None | Level value if column is for focal_var, else None. |
Examples:
>>> extract_level_from_column("treatment[A]", "treatment")
'A'
>>> extract_level_from_column("treatment[A]:x", "treatment")
'A'
>>> extract_level_from_column("x", "treatment")
None
>>> extract_level_from_column("group[1]", "treatment")
Nonehelmert_coding¶
helmert_coding(levels: list[str]) -> NDArray[np.float64]Build Helmert contrast matrix.
Helmert coding compares each level to the mean of all previous levels. Column j contrasts level j+1 against the average of levels 0..j.
This is equivalent to R’s contr.helmert() (scaled to unit contrasts).
Matrix structure for 4 levels::
Contrast | A | B | C | D
-----------|---------|---------|---------|--------
B vs A | -1/2 | 1/2 | 0 | 0
C vs A,B | -1/3 | -1/3 | 2/3 | 0
D vs A,B,C | -1/4 | -1/4 | -1/4 | 3/4Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. Must have >= 2 elements. | required |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Contrast matrix of shape (n_levels, n_levels - 1). |
NDArray[float64] | Row order matches input levels order. |
Examples:
>>> helmert_coding(['A', 'B', 'C'])
array([[-0.5 , -0.33333333],
[ 0.5 , -0.33333333],
[ 0. , 0.66666667]])helmert_coding_labels¶
helmert_coding_labels(levels: list[str]) -> list[str]Get column labels for Helmert contrast.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
Returns:
| Type | Description |
|---|---|
list[str] | List of labels like [‘B vs prev’, ‘C vs prev’, ...]. |
Examples:
>>> helmert_coding_labels(['A', 'B', 'C'])
['B vs prev', 'C vs prev']identify_column_type¶
identify_column_type(name: str) -> Literal['intercept', 'continuous', 'categorical']Identify column type from name (simplified version).
This is a lightweight alternative to parse_design_column_name() when only the type is needed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | Column name from design matrix. | required |
Returns:
| Type | Description |
|---|---|
Literal[‘intercept’, ‘continuous’, ‘categorical’] | Column type as string literal. |
parse_design_column_name¶
parse_design_column_name(name: str) -> DesignColumnInfoParse design matrix column name into components.
Handles standard R/formula naming conventions:
“Intercept” → intercept type
“x” → continuous variable
“treatment[A]” → categorical dummy for level A
“x:z” → continuous interaction
“treatment[A]:x” → categorical × continuous interaction
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | Column name from design matrix. | required |
Returns:
| Type | Description |
|---|---|
DesignColumnInfo | DesignColumnInfo with parsed components. |
Examples:
>>> parse_design_column_name("Intercept")
DesignColumnInfo(raw_name='Intercept', base_term='Intercept',
level=None, column_type='intercept', is_interaction=False)>>> parse_design_column_name("x")
DesignColumnInfo(raw_name='x', base_term='x',
level=None, column_type='continuous', is_interaction=False)>>> parse_design_column_name("treatment[A]")
DesignColumnInfo(raw_name='treatment[A]', base_term='treatment',
level='A', column_type='categorical', is_interaction=False)>>> parse_design_column_name("treatment[A]:x")
DesignColumnInfo(raw_name='treatment[A]:x', base_term='treatment:x',
level='A', column_type='categorical', is_interaction=True)poly_coding¶
poly_coding(levels: list[str]) -> NDArray[np.float64]Build orthogonal polynomial contrast matrix.
Polynomial coding creates orthogonal contrasts representing linear, quadratic, cubic, etc. trends across ordered factor levels. This is equivalent to R’s contr.poly() function.
The contrasts are orthonormal (orthogonal and unit length), making them suitable for testing polynomial trends in ordered categorical variables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. The order determines the polynomial evaluation points. | required |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Contrast matrix of shape (n_levels, n_levels - 1). |
NDArray[float64] | Column 0 is linear (.L), column 1 is quadratic (.Q), etc. |
Examples:
>>> poly_coding(['low', 'medium', 'high'])
array([[-0.70710678, 0.40824829],
[ 0. , -0.81649658],
[ 0.70710678, 0.40824829]])>>> poly_coding(['A', 'B', 'C', 'D'])
array([[-0.67082039, 0.5 , -0.2236068 ],
[-0.2236068 , -0.5 , 0.67082039],
[ 0.2236068 , -0.5 , -0.67082039],
[ 0.67082039, 0.5 , 0.2236068 ]])Note: Level names are not used in computation - only the count and order matter. The polynomial is evaluated at equally-spaced points 1, 2, ..., n.
poly_coding_labels¶
poly_coding_labels(levels: list[str]) -> list[str]Get column labels for polynomial contrast.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
Returns:
| Type | Description |
|---|---|
list[str] | List of polynomial degree labels: ['.L', '.Q', '.C', '^4', '^5', ...]. |
sequential_coding¶
sequential_coding(levels: list[str]) -> NDArray[np.float64]Build sequential (successive differences) contrast matrix.
Sequential coding compares each level to the previous level, producing contrasts that capture successive differences. This is equivalent to R’s MASS::contr.sdif() function.
The matrix is constructed so that each column j represents the difference between level j+1 and level j. The resulting coefficients in a regression model estimate these successive differences directly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Contrast matrix of shape (n_levels, n_levels - 1). |
NDArray[float64] | Column j represents the contrast: level[j+1] - level[j]. |
Examples:
>>> sequential_coding(['A', 'B', 'C'])
array([[-0.66666667, -0.33333333],
[ 0.33333333, -0.33333333],
[ 0.33333333, 0.66666667]])>>> sequential_coding(['low', 'medium', 'high', 'very_high'])
array([[-0.75, -0.5 , -0.25],
[ 0.25, -0.5 , -0.25],
[ 0.25, 0.5 , -0.25],
[ 0.25, 0.5 , 0.75]])Note: This coding is most meaningful for ordered factors where you want to estimate the “step” from one level to the next. Unlike polynomial contrasts, it does not assume equally-spaced levels.
The matrix structure ensures that multiplying by the coefficient vector gives interpretable successive differences.
sequential_coding_labels¶
sequential_coding_labels(levels: list[str]) -> list[str]Get column labels for sequential contrast.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
Returns:
| Type | Description |
|---|---|
list[str] | List of successive difference labels like [‘B-A’, ‘C-B’, ...]. |
Examples:
>>> sequential_coding_labels(['A', 'B', 'C'])
['B-A', 'C-B']>>> sequential_coding_labels(['low', 'medium', 'high'])
['medium-low', 'high-medium']sum_coding¶
sum_coding(levels: list[str], omit: str | None = None) -> NDArray[np.float64]Build sum (effects) contrast matrix.
Sum coding sets the omitted level to all -1s, and each other level gets a one-hot encoded row. This centers the effects around zero, making coefficients interpretable as deviations from the grand mean.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
omit | str | None | Level to omit (gets -1s). Defaults to last level. | None |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Contrast matrix of shape (n_levels, n_levels - 1). |
NDArray[float64] | Row order matches input levels order. |
Examples:
>>> sum_coding(['A', 'B', 'C'])
array([[ 1., 0.],
[ 0., 1.],
[-1., -1.]])>>> sum_coding(['A', 'B', 'C'], omit='A')
array([[-1., -1.],
[ 1., 0.],
[ 0., 1.]])sum_coding_labels¶
sum_coding_labels(levels: list[str], omit: str | None = None) -> list[str]Get column labels for sum contrast.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
omit | str | None | Level to omit. Defaults to last level. | None |
Returns:
| Type | Description |
|---|---|
list[str] | List of non-omitted level names (column labels). |
treatment_coding¶
treatment_coding(levels: list[str], reference: str | None = None) -> NDArray[np.float64]Build treatment (dummy) contrast matrix.
Treatment coding sets the reference level to all zeros, and each other level gets a one-hot encoded row. This is the most common coding for regression models with an intercept.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
reference | str | None | Reference level name. Defaults to first level. | None |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Contrast matrix of shape (n_levels, n_levels - 1). |
NDArray[float64] | Row order matches input levels order. |
Examples:
>>> treatment_coding(['A', 'B', 'C'])
array([[0., 0.],
[1., 0.],
[0., 1.]])>>> treatment_coding(['A', 'B', 'C'], reference='B')
array([[1., 0.],
[0., 0.],
[0., 1.]])treatment_coding_labels¶
treatment_coding_labels(levels: list[str], reference: str | None = None) -> list[str]Get column labels for treatment contrast.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
reference | str | None | Reference level name. Defaults to first level. | None |
Returns:
| Type | Description |
|---|---|
list[str] | List of non-reference level names (column labels). |
Modules¶
coding¶
Contrast matrix builders for categorical variable encoding.
This module provides functions to create contrast matrices for encoding categorical variables in design matrices. These are distinct from the EMM contrast matrices in
Key concept: A contrast matrix maps k categorical levels to k-1 columns in the design matrix (assuming an intercept absorbs one degree of freedom).
Treatment (dummy coding): Reference level = 0, others = one-hot
Treatment (dummy coding): Reference level = 0, others = one-hot
Sum (effects coding): Omitted level = -1s, others = one-hot
Poly (orthogonal polynomial): Linear, quadratic, cubic, etc. trends
Custom: User-specified contrast vectors converted via array_to_coding_matrix
Examples:
>>> from coding import treatment_coding, sum_coding, poly_coding
>>> treatment_coding(['A', 'B', 'C'])
array([[0., 0.],
[1., 0.],
[0., 1.]])
>>> sum_coding(['A', 'B', 'C'])
array([[ 1., 0.],
[ 0., 1.],
[-1., -1.]])
>>> poly_coding(['A', 'B', 'C']) # Linear and quadratic trends
array([[-0.707..., 0.408...],
[ 0. ..., -0.816...],
[ 0.707..., 0.408...]])
>>> # Custom contrast: A vs average(B, C)
>>> array_to_coding_matrix([[-1, 0.5, 0.5]], n_levels=3)
array([[-0.816..., ...],
[ 0.408..., ...],
[ 0.408..., ...]])Functions:
| Name | Description |
|---|---|
array_to_coding_matrix | Convert user-specified contrasts to a coding matrix for design matrices. |
convert_coding_to_hypothesis | Convert a coding matrix back to interpretable hypothesis contrasts. |
helmert_coding | Build Helmert contrast matrix. |
helmert_coding_labels | Get column labels for Helmert contrast. |
poly_coding | Build orthogonal polynomial contrast matrix. |
poly_coding_labels | Get column labels for polynomial contrast. |
sequential_coding | Build sequential (successive differences) contrast matrix. |
sequential_coding_labels | Get column labels for sequential contrast. |
sum_coding | Build sum (effects) contrast matrix. |
sum_coding_labels | Get column labels for sum contrast. |
treatment_coding | Build treatment (dummy) contrast matrix. |
treatment_coding_labels | Get column labels for treatment contrast. |
Functions¶
array_to_coding_matrix¶
array_to_coding_matrix(contrasts: NDArray[np.floating] | list[float] | list[list[float]], n_levels: int, *, normalize: bool = True) -> NDArray[np.float64]Convert user-specified contrasts to a coding matrix for design matrices.
This function converts “human-readable” contrast specifications (where each row represents a hypothesis like “A vs average(B, C)”) into a coding matrix suitable for use in regression design matrices.
The algorithm uses QR decomposition to auto-complete under-specified contrasts with orthogonal contrasts, following the approach from R’s gmodels::make.contrasts() and pymer4’s con2R().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
contrasts | NDArray[floating] | list[float] | list[list[float]] | User-specified contrasts as: - 1D array/list: Single contrast vector of length n_levels - 2D array/list: Multiple contrasts, shape (n_contrasts, n_levels) Each row sums to zero for valid contrasts. | required |
n_levels | int | Number of factor levels. Must match contrast dimensions. | required |
normalize | bool | If True, normalize each contrast vector by its L2 norm before conversion. This puts contrasts in standard-deviation units similar to orthogonal polynomial contrasts. | True |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Coding matrix of shape (n_levels, n_levels - 1). Each row corresponds |
NDArray[float64] | to a factor level, each column to a design matrix column. |
Examples:
>>> # Single contrast: A vs average(B, C)
>>> array_to_coding_matrix([-1, 0.5, 0.5], n_levels=3)
array([[-0.81649658, 0. ],
[ 0.40824829, -0.70710678],
[ 0.40824829, 0.70710678]])>>> # Multiple contrasts: A vs B, and (A,B) vs C
>>> array_to_coding_matrix([[-1, 1, 0], [-0.5, -0.5, 1]], n_levels=3)
array([[-0.5 , -0.28867513],
[ 0.5 , -0.28867513],
[ 0. , 0.57735027]])Note: The returned matrix has n_levels-1 columns because one degree of freedom is absorbed by the intercept. If you specify fewer than n_levels-1 contrasts, the remaining columns are auto-completed with orthogonal contrasts via QR decomposition.
convert_coding_to_hypothesis¶
convert_coding_to_hypothesis(coding_matrix: NDArray[np.float64]) -> NDArray[np.float64]Convert a coding matrix back to interpretable hypothesis contrasts.
This is the inverse of array_to_coding_matrix. Given a coding matrix (n_levels, n_levels-1), returns the hypothesis matrix where each row represents the linear combination of factor levels being compared.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
coding_matrix | NDArray[float64] | Coding matrix of shape (n_levels, n_levels - 1). | required |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Hypothesis matrix of shape (n_levels - 1, n_levels). Each row |
NDArray[float64] | represents a contrast hypothesis (coefficients for factor levels). |
Examples:
>>> cm = treatment_coding(['A', 'B', 'C'])
>>> convert_coding_to_hypothesis(cm)
array([[-1., 1., 0.],
[-1., 0., 1.]])helmert_coding¶
helmert_coding(levels: list[str]) -> NDArray[np.float64]Build Helmert contrast matrix.
Helmert coding compares each level to the mean of all previous levels. Column j contrasts level j+1 against the average of levels 0..j.
This is equivalent to R’s contr.helmert() (scaled to unit contrasts).
Matrix structure for 4 levels::
Contrast | A | B | C | D
-----------|---------|---------|---------|--------
B vs A | -1/2 | 1/2 | 0 | 0
C vs A,B | -1/3 | -1/3 | 2/3 | 0
D vs A,B,C | -1/4 | -1/4 | -1/4 | 3/4Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. Must have >= 2 elements. | required |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Contrast matrix of shape (n_levels, n_levels - 1). |
NDArray[float64] | Row order matches input levels order. |
Examples:
>>> helmert_coding(['A', 'B', 'C'])
array([[-0.5 , -0.33333333],
[ 0.5 , -0.33333333],
[ 0. , 0.66666667]])helmert_coding_labels¶
helmert_coding_labels(levels: list[str]) -> list[str]Get column labels for Helmert contrast.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
Returns:
| Type | Description |
|---|---|
list[str] | List of labels like [‘B vs prev’, ‘C vs prev’, ...]. |
Examples:
>>> helmert_coding_labels(['A', 'B', 'C'])
['B vs prev', 'C vs prev']poly_coding¶
poly_coding(levels: list[str]) -> NDArray[np.float64]Build orthogonal polynomial contrast matrix.
Polynomial coding creates orthogonal contrasts representing linear, quadratic, cubic, etc. trends across ordered factor levels. This is equivalent to R’s contr.poly() function.
The contrasts are orthonormal (orthogonal and unit length), making them suitable for testing polynomial trends in ordered categorical variables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. The order determines the polynomial evaluation points. | required |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Contrast matrix of shape (n_levels, n_levels - 1). |
NDArray[float64] | Column 0 is linear (.L), column 1 is quadratic (.Q), etc. |
Examples:
>>> poly_coding(['low', 'medium', 'high'])
array([[-0.70710678, 0.40824829],
[ 0. , -0.81649658],
[ 0.70710678, 0.40824829]])>>> poly_coding(['A', 'B', 'C', 'D'])
array([[-0.67082039, 0.5 , -0.2236068 ],
[-0.2236068 , -0.5 , 0.67082039],
[ 0.2236068 , -0.5 , -0.67082039],
[ 0.67082039, 0.5 , 0.2236068 ]])Note: Level names are not used in computation - only the count and order matter. The polynomial is evaluated at equally-spaced points 1, 2, ..., n.
poly_coding_labels¶
poly_coding_labels(levels: list[str]) -> list[str]Get column labels for polynomial contrast.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
Returns:
| Type | Description |
|---|---|
list[str] | List of polynomial degree labels: ['.L', '.Q', '.C', '^4', '^5', ...]. |
sequential_coding¶
sequential_coding(levels: list[str]) -> NDArray[np.float64]Build sequential (successive differences) contrast matrix.
Sequential coding compares each level to the previous level, producing contrasts that capture successive differences. This is equivalent to R’s MASS::contr.sdif() function.
The matrix is constructed so that each column j represents the difference between level j+1 and level j. The resulting coefficients in a regression model estimate these successive differences directly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Contrast matrix of shape (n_levels, n_levels - 1). |
NDArray[float64] | Column j represents the contrast: level[j+1] - level[j]. |
Examples:
>>> sequential_coding(['A', 'B', 'C'])
array([[-0.66666667, -0.33333333],
[ 0.33333333, -0.33333333],
[ 0.33333333, 0.66666667]])>>> sequential_coding(['low', 'medium', 'high', 'very_high'])
array([[-0.75, -0.5 , -0.25],
[ 0.25, -0.5 , -0.25],
[ 0.25, 0.5 , -0.25],
[ 0.25, 0.5 , 0.75]])Note: This coding is most meaningful for ordered factors where you want to estimate the “step” from one level to the next. Unlike polynomial contrasts, it does not assume equally-spaced levels.
The matrix structure ensures that multiplying by the coefficient vector gives interpretable successive differences.
sequential_coding_labels¶
sequential_coding_labels(levels: list[str]) -> list[str]Get column labels for sequential contrast.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
Returns:
| Type | Description |
|---|---|
list[str] | List of successive difference labels like [‘B-A’, ‘C-B’, ...]. |
Examples:
>>> sequential_coding_labels(['A', 'B', 'C'])
['B-A', 'C-B']>>> sequential_coding_labels(['low', 'medium', 'high'])
['medium-low', 'high-medium']sum_coding¶
sum_coding(levels: list[str], omit: str | None = None) -> NDArray[np.float64]Build sum (effects) contrast matrix.
Sum coding sets the omitted level to all -1s, and each other level gets a one-hot encoded row. This centers the effects around zero, making coefficients interpretable as deviations from the grand mean.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
omit | str | None | Level to omit (gets -1s). Defaults to last level. | None |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Contrast matrix of shape (n_levels, n_levels - 1). |
NDArray[float64] | Row order matches input levels order. |
Examples:
>>> sum_coding(['A', 'B', 'C'])
array([[ 1., 0.],
[ 0., 1.],
[-1., -1.]])>>> sum_coding(['A', 'B', 'C'], omit='A')
array([[-1., -1.],
[ 1., 0.],
[ 0., 1.]])sum_coding_labels¶
sum_coding_labels(levels: list[str], omit: str | None = None) -> list[str]Get column labels for sum contrast.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
omit | str | None | Level to omit. Defaults to last level. | None |
Returns:
| Type | Description |
|---|---|
list[str] | List of non-omitted level names (column labels). |
treatment_coding¶
treatment_coding(levels: list[str], reference: str | None = None) -> NDArray[np.float64]Build treatment (dummy) contrast matrix.
Treatment coding sets the reference level to all zeros, and each other level gets a one-hot encoded row. This is the most common coding for regression models with an intercept.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
reference | str | None | Reference level name. Defaults to first level. | None |
Returns:
| Type | Description |
|---|---|
NDArray[float64] | Contrast matrix of shape (n_levels, n_levels - 1). |
NDArray[float64] | Row order matches input levels order. |
Examples:
>>> treatment_coding(['A', 'B', 'C'])
array([[0., 0.],
[1., 0.],
[0., 1.]])>>> treatment_coding(['A', 'B', 'C'], reference='B')
array([[1., 0.],
[0., 0.],
[0., 1.]])treatment_coding_labels¶
treatment_coding_labels(levels: list[str], reference: str | None = None) -> list[str]Get column labels for treatment contrast.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
levels | list[str] | Ordered list of categorical level names. | required |
reference | str | None | Reference level name. Defaults to first level. | None |
Returns:
| Type | Description |
|---|---|
list[str] | List of non-reference level names (column labels). |
names¶
Design matrix column name parsing and variable type detection.
Classes:
| Name | Description |
|---|---|
DesignColumnInfo | Parsed design matrix column metadata. |
Functions:
| Name | Description |
|---|---|
extract_base_term | Extract base term name from column name. |
extract_categorical_variables | Find all categorical base variable names from design matrix columns. |
identify_column_type | Identify column type from name (simplified version). |
parse_design_column_name | Parse design matrix column name into components. |
Classes¶
DesignColumnInfo¶
DesignColumnInfo(raw_name: str, base_term: str, level: str | None, column_type: Literal['intercept', 'continuous', 'categorical'], is_interaction: bool = False) -> NoneParsed design matrix column metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
raw_name | str | Original column name (e.g., “treatment[A]”). |
base_term | str | Base variable name without level (e.g., “treatment”). |
level | str | None | Level value for categorical, None for continuous (e.g., “A”). |
column_type | Literal[‘intercept’, ‘continuous’, ‘categorical’] | Type classification. |
is_interaction | bool | Whether this is an interaction term. |
Attributes¶
base_term¶
base_term: strcolumn_type¶
column_type: Literal['intercept', 'continuous', 'categorical']is_interaction¶
is_interaction: bool = Falselevel¶
level: str | Noneraw_name¶
raw_name: strFunctions¶
extract_base_term¶
extract_base_term(name: str) -> strExtract base term name from column name.
For categorical variables, strips the level suffix. For interactions, extracts base terms without levels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | Column name from design matrix. | required |
Returns:
| Type | Description |
|---|---|
str | Base term name. |
Examples:
>>> extract_base_term("treatment[A]")
'treatment'
>>> extract_base_term("x")
'x'
>>> extract_base_term("treatment[A]:x")
'treatment:x'extract_categorical_variables¶
extract_categorical_variables(X_names: tuple[str, ...] | list[str]) -> set[str]Find all categorical base variable names from design matrix columns.
Scans column names for bracket patterns and extracts unique base terms.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_names | tuple[str, ...] | list[str] | Column names from design matrix. | required |
Returns:
| Type | Description |
|---|---|
set[str] | Set of categorical variable names (base terms, no levels). |
Examples:
>>> extract_categorical_variables(["Intercept", "x", "treatment[A]", "treatment[B]"])
{'treatment'}identify_column_type¶
identify_column_type(name: str) -> Literal['intercept', 'continuous', 'categorical']Identify column type from name (simplified version).
This is a lightweight alternative to parse_design_column_name() when only the type is needed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | Column name from design matrix. | required |
Returns:
| Type | Description |
|---|---|
Literal[‘intercept’, ‘continuous’, ‘categorical’] | Column type as string literal. |
parse_design_column_name¶
parse_design_column_name(name: str) -> DesignColumnInfoParse design matrix column name into components.
Handles standard R/formula naming conventions:
“Intercept” → intercept type
“x” → continuous variable
“treatment[A]” → categorical dummy for level A
“x:z” → continuous interaction
“treatment[A]:x” → categorical × continuous interaction
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | Column name from design matrix. | required |
Returns:
| Type | Description |
|---|---|
DesignColumnInfo | DesignColumnInfo with parsed components. |
Examples:
>>> parse_design_column_name("Intercept")
DesignColumnInfo(raw_name='Intercept', base_term='Intercept',
level=None, column_type='intercept', is_interaction=False)>>> parse_design_column_name("x")
DesignColumnInfo(raw_name='x', base_term='x',
level=None, column_type='continuous', is_interaction=False)>>> parse_design_column_name("treatment[A]")
DesignColumnInfo(raw_name='treatment[A]', base_term='treatment',
level='A', column_type='categorical', is_interaction=False)>>> parse_design_column_name("treatment[A]:x")
DesignColumnInfo(raw_name='treatment[A]:x', base_term='treatment:x',
level='A', column_type='categorical', is_interaction=True)reference¶
Reference design matrix (X_ref) construction for marginal effects.
Functions:
| Name | Description |
|---|---|
build_continuous_reference_matrix | Build reference matrix for a continuous focal variable at specific values. |
build_counterfactual_design_matrices | Build counterfactual design matrices for g-computation. |
build_reference_design_matrix | Build design matrix for reference grid points. |
build_reference_row | Build a single row of the reference design matrix. |
Functions¶
build_continuous_reference_matrix¶
build_continuous_reference_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, at_values: tuple[float, ...], X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarrayBuild reference matrix for a continuous focal variable at specific values.
Creates one row per value in at_values. Each row has covariates at
their means except the focal variable, which is set to the specified value.
For interaction columns involving the focal variable, the interaction
value is properly computed as a product of component values at each
at_value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_names | tuple[str, ...] | list[str] | Column names from the design matrix. | required |
focal_var | str | Name of the continuous focal variable. | required |
at_values | tuple[float, ...] | Specific values to evaluate the focal variable at. | required |
X_means | ndarray | Column means of the original design matrix, shape (p,). | required |
set_categoricals | dict[str, str] | None | Optional dict mapping non-focal categorical variable names to specific levels for indicator encoding instead of marginalizing at column means. | None |
Returns:
| Type | Description |
|---|---|
ndarray | Reference design matrix X_ref, shape (n_values, p). |
build_counterfactual_design_matrices¶
build_counterfactual_design_matrices(X: np.ndarray, X_names: tuple[str, ...] | list[str], focal_var: str, levels: list[str]) -> list[np.ndarray]Build counterfactual design matrices for g-computation.
For each focal level, creates a modified copy of the full design matrix
X where the focal variable’s indicator columns are set to match that
level and interaction columns involving the focal variable are recomputed
from component values. Non-focal columns are left unchanged, preserving
the observed covariate distribution.
This is the core building block for weights="observed"
(g-computation / counterfactual prediction). Each returned matrix answers:
“What would the design matrix look like if every observation were assigned
to this focal level?”
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X | ndarray | Original design matrix, shape (N, p). | required |
X_names | tuple[str, ...] | list[str] | Column names from the design matrix, in order. | required |
focal_var | str | Name of the categorical focal variable. | required |
levels | list[str] | List of levels to compute counterfactuals for. | required |
Returns:
| Type | Description |
|---|---|
list[ndarray] | List of counterfactual design matrices (one per level), each shape |
list[ndarray] | (N, p). Order matches levels. |
Examples:
For a model y ~ treatment + x + treatment:x with
X_names = ("Intercept", "x", "treatment[B]", "x:treatment[B]")::
mats = build_counterfactual_design_matrices(X, X_names, "treatment", ["ref", "B"])
# mats[0]: treatment set to ref for all rows (treatment[B]=0, x:treatment[B]=0)
# mats[1]: treatment set to B for all rows (treatment[B]=1, x:treatment[B]=x_i)build_reference_design_matrix¶
build_reference_design_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, levels: list[str], X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarrayBuild design matrix for reference grid points.
Creates an X_ref matrix with one row per focal variable level. Each row represents a reference point where the focal variable is set to that level and all other covariates are set to their reference values.
Reference value conventions:
Intercept: 1.0
Focal variable dummies: 1.0 if matching level, 0.0 otherwise
Continuous covariates: column mean from X_means
Non-focal categorical dummies: column mean (marginalize over observed proportions)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_names | tuple[str, ...] | list[str] | Column names from the design matrix, in order. | required |
focal_var | str | Name of the categorical variable to vary across levels. | required |
levels | list[str] | List of levels for the focal variable, defining row order. | required |
X_means | ndarray | Column means of the original design matrix, shape (p,). Used for continuous covariate reference values. | required |
set_categoricals | dict[str, str] | None | Optional dict mapping non-focal categorical variable names to specific levels to pin them at (instead of marginalizing at X_means). E.g. {"Ethnicity": "Asian"} sets the Ethnicity dummies to indicator values for “Asian”. | None |
Returns:
| Type | Description |
|---|---|
ndarray | Reference design matrix X_ref, shape (n_levels, p). |
Examples:
Compute X_ref for treatment EMMs::
X_names = ("Intercept", "x", "treatment[A]", "treatment[B]")
X_means = np.array([1.0, 2.5, 0.33, 0.33]) # means from data
levels = ["ref", "A", "B"]
X_ref = build_reference_design_matrix(X_names, "treatment", levels, X_means)
# X_ref[0] = [1.0, 2.5, 0.0, 0.0] # reference level
# X_ref[1] = [1.0, 2.5, 1.0, 0.0] # level A
# X_ref[2] = [1.0, 2.5, 0.0, 1.0] # level BNote: The first level is typically the reference level (omitted from dummy coding), so its row has all 0s for the focal variable dummies.
build_reference_row¶
build_reference_row(X_names: tuple[str, ...] | list[str], focal_var: str, focal_level: str, X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarrayBuild a single row of the reference design matrix.
Creates one reference point where the focal variable is set to the specified level and other covariates are at reference values.
For interaction columns involving the focal variable (e.g.,
Income:Student[Yes] when focal_var="Student"), the value is
computed as the product of component values rather than using the
empirical mean of the interaction column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_names | tuple[str, ...] | list[str] | Column names from the design matrix. | required |
focal_var | str | Name of the focal categorical variable. | required |
focal_level | str | Level value to set for the focal variable. | required |
X_means | ndarray | Column means for continuous covariate reference values. | required |
set_categoricals | dict[str, str] | None | Optional dict mapping non-focal categorical variable names to specific levels for indicator encoding. When a non-focal categorical’s base_term matches a key, the dummy is set to 1.0 if the level matches, 0.0 otherwise (instead of using the column mean for marginalization). | None |
Returns:
| Type | Description |
|---|---|
ndarray | Reference row, shape (p,). |
z_matrix¶
Sparse Z matrix (random effects design matrix) construction.
Classes:
| Name | Description |
|---|---|
RandomEffectsInfo | Complete random effects specification for lmer/glmer. |
Functions:
| Name | Description |
|---|---|
build_random_effects | Build complete random effects specification. |
build_z_crossed | Build Z matrix for crossed random effects. |
build_z_nested | Build Z matrix for nested random effects. |
build_z_simple | Build Z matrix for single grouping factor. |
Classes¶
RandomEffectsInfo¶
RandomEffectsInfo(Z: sp.csc_matrix, group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, re_structures_list: list[str] | None = None, re_dims_list: list[int] | None = None, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, column_labels: list[str] = list(), term_permutation: NDArray[np.intp] | None = None) -> NoneComplete random effects specification for lmer/glmer.
This container holds the Z matrix and all metadata needed for downstream operations (Lambda building, initialization, results).
Attributes:
| Name | Type | Description |
|---|---|---|
Z | csc_matrix | Sparse random effects design matrix, shape (n, q). |
group_ids_list | list[NDArray[intp]] | Group ID arrays for each factor. |
n_groups_list | list[int] | Number of groups per factor. |
group_names | list[str] | Names of grouping factors. |
random_names | list[str] | Names of random effect terms. |
re_structure | str | Overall structure type (intercept/slope/diagonal/nested/crossed). |
re_structures_list | list[str] | None | Per-factor structure types (for mixed). |
X_re | NDArray[float64] | list[NDArray[float64]] | None | Random effects covariates (for slopes). |
column_labels | list[str] | Z column names for output. |
term_permutation | NDArray[intp] | None | Block ordering permutation indices. |
Attributes¶
X_re¶
X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = NoneZ¶
Z: sp.csc_matrixcolumn_labels¶
column_labels: list[str] = field(default_factory=list)group_ids_list¶
group_ids_list: list[NDArray[np.intp]]group_names¶
group_names: list[str]n_groups_list¶
n_groups_list: list[int]random_names¶
random_names: list[str]re_dims_list¶
re_dims_list: list[int] | None = Nonere_structure¶
re_structure: strre_structures_list¶
re_structures_list: list[str] | None = Noneterm_permutation¶
term_permutation: NDArray[np.intp] | None = NoneFunctions¶
build_random_effects¶
build_random_effects(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, re_structures_list: list[str] | None = None, group_levels_list: list[list[str]] | None = None, term_permutation: NDArray[np.intp] | None = None) -> RandomEffectsInfoBuild complete random effects specification.
High-level function that constructs the Z matrix and packages all metadata into a RandomEffectsInfo container ready for lmer/glmer consumption.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_ids_list | list[NDArray[intp]] | Group ID arrays for each factor. | required |
n_groups_list | list[int] | Number of groups per factor. | required |
group_names | list[str] | Names of grouping factors. | required |
random_names | list[str] | Names of random effect terms. | required |
re_structure | str | Overall structure type: - “intercept”: random intercept only - “slope”: correlated intercept + slopes - “diagonal”: uncorrelated intercept + slopes - “nested”: nested hierarchy - “crossed”: crossed factors | required |
X_re | NDArray[float64] | list[NDArray[float64]] | None | Random effects covariates (for slopes). | None |
re_structures_list | list[str] | None | Per-factor structure (for mixed). | None |
group_levels_list | list[list[str]] | None | Level names per factor (for labels). | None |
term_permutation | NDArray[intp] | None | Block ordering permutation. | None |
Returns:
| Type | Description |
|---|---|
RandomEffectsInfo | RandomEffectsInfo with Z matrix and all metadata. |
Examples:
>>> # (Days|Subject) with 18 subjects
>>> group_ids = np.arange(180) // 10
>>> n_groups = 18
>>> X_re = np.column_stack([np.ones(180), np.tile(np.arange(10), 18)])
>>> info = build_random_effects(
... group_ids_list=[group_ids],
... n_groups_list=[n_groups],
... group_names=["Subject"],
... random_names=["Intercept", "Days"],
... re_structure="slope",
... X_re=X_re,
... )
>>> info.Z.shape
(180, 36) # 18 subjects * 2 REbuild_z_crossed¶
build_z_crossed(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None, layouts: list[str] | None = None) -> sp.csc_matrixBuild Z matrix for crossed random effects.
Crossed effects like (1|subject) + (1|item) create independent random effects for each factor. The Z matrix is a horizontal concatenation of Z matrices for each factor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_ids_list | list[NDArray[intp]] | List of group ID arrays, one per factor. | required |
n_groups_list | list[int] | Number of groups per factor. | required |
X_re_list | list[NDArray[float64] | None] | None | Random effects design per factor, or None for intercepts. | None |
layouts | list[str] | None | Layout per factor. Default: interleaved for all. | None |
Returns:
| Type | Description |
|---|---|
csc_matrix | Sparse Z matrix, shape (n, sum(n_groups_i * n_re_i)). |
Examples:
>>> # (1|subject) + (1|item) with 3 subjects, 4 items
>>> subj_ids = np.array([0, 1, 2, 0, 1, 2])
>>> item_ids = np.array([0, 0, 0, 1, 1, 1])
>>> Z = build_z_crossed(
... group_ids_list=[subj_ids, item_ids],
... n_groups_list=[3, 4]
... )
>>> Z.shape
(6, 7) # 3 subject + 4 item columnsbuild_z_nested¶
build_z_nested(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None) -> sp.csc_matrixBuild Z matrix for nested random effects.
Nested effects like (1|school/class) create separate random intercepts for each level of the hierarchy. The Z matrix is a horizontal concatenation of Z matrices for each level.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_ids_list | list[NDArray[intp]] | List of group ID arrays, ordered [inner, ..., outer]. For (1 | school/class): [class_ids, school_ids]. |
n_groups_list | list[int] | Number of groups at each level. | required |
X_re_list | list[NDArray[float64] | None] | None | Random effects design per level, or None for intercepts. | None |
Returns:
| Type | Description |
|---|---|
csc_matrix | Sparse Z matrix, shape (n, sum(n_groups_i * n_re_i)). |
Examples:
>>> # (1|school/class) with 2 schools, 4 classes total
>>> class_ids = np.array([0, 0, 1, 1, 2, 2, 3, 3])
>>> school_ids = np.array([0, 0, 0, 0, 1, 1, 1, 1])
>>> Z = build_z_nested(
... group_ids_list=[class_ids, school_ids],
... n_groups_list=[4, 2]
... )
>>> Z.shape
(8, 6) # 4 class columns + 2 school columnsbuild_z_simple¶
build_z_simple(group_ids: NDArray[np.intp], n_groups: int, X_re: NDArray[np.float64] | None = None, layout: Literal['interleaved', 'blocked'] = 'interleaved') -> sp.csc_matrixBuild Z matrix for single grouping factor.
Constructs Z directly in sparse COO format without dense intermediates. For large-scale data (e.g., InstEval with 73k obs x 4k groups), this uses O(n x n_re) memory instead of O(n x n_groups x n_re).
Handles intercept-only, correlated slopes, and uncorrelated slopes by varying the X_re input and layout parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_ids | NDArray[intp] | Array of group assignments, shape (n,), values 0..n_groups-1. | required |
n_groups | int | Total number of groups. | required |
X_re | NDArray[float64] | None | Random effects design matrix, shape (n, n_re). - None or column of 1s: intercept only - Multiple columns: intercept + slopes | None |
layout | Literal[‘interleaved’, ‘blocked’] | Column ordering. - “interleaved”: [g1_int, g1_slope, g2_int, g2_slope, ...] - “blocked”: [g1_int, g2_int, ..., g1_slope, g2_slope, ...] | ‘interleaved’ |
Returns:
| Type | Description |
|---|---|
csc_matrix | Sparse Z matrix in CSC format, shape (n, n_groups * n_re). |
Examples:
>>> # Random intercept only
>>> group_ids = np.array([0, 0, 1, 1])
>>> Z = build_z_simple(group_ids, n_groups=2)
>>> Z.shape
(4, 2)>>> # Random intercept + slope (correlated)
>>> X_re = np.column_stack([np.ones(4), [1, 2, 1, 2]])
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="interleaved")
>>> Z.shape
(4, 4)>>> # Uncorrelated random effects
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="blocked")
>>> Z.shape
(4, 4)