Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Design matrix construction — coding, naming, reference grids, random effects.

Call chain:

formula.build_design_matrices() -> treatment_coding() / sum_coding() / ... (categorical columns)
marginal.build_reference_grid() -> build_reference_row() (EMM reference grids)
formula.build_random_effects_from_spec() -> build_z_simple() / build_z_nested() / build_z_crossed()

Classes:

NameDescription
DesignColumnInfoParsed design matrix column metadata.
RandomEffectsInfoComplete random effects specification for lmer/glmer.

Functions:

NameDescription
array_to_coding_matrixConvert user-specified contrasts to a coding matrix for design matrices.
build_random_effectsBuild complete random effects specification.
build_reference_design_matrixBuild design matrix for reference grid points.
build_reference_rowBuild a single row of the reference design matrix.
build_slope_reference_matrixBuild reference matrices for computing marginal slopes.
build_z_crossedBuild Z matrix for crossed random effects.
build_z_nestedBuild Z matrix for nested random effects.
build_z_simpleBuild Z matrix for single grouping factor.
convert_coding_to_hypothesisConvert a coding matrix back to interpretable hypothesis contrasts.
extract_base_termExtract base term name from column name.
extract_categorical_variablesFind all categorical base variable names from design matrix columns.
extract_level_from_columnExtract level value for a specific focal variable from column name.
helmert_codingBuild Helmert contrast matrix.
helmert_coding_labelsGet column labels for Helmert contrast.
identify_column_typeIdentify column type from name (simplified version).
parse_design_column_nameParse design matrix column name into components.
poly_codingBuild orthogonal polynomial contrast matrix.
poly_coding_labelsGet column labels for polynomial contrast.
sequential_codingBuild sequential (successive differences) contrast matrix.
sequential_coding_labelsGet column labels for sequential contrast.
sum_codingBuild sum (effects) contrast matrix.
sum_coding_labelsGet column labels for sum contrast.
treatment_codingBuild treatment (dummy) contrast matrix.
treatment_coding_labelsGet column labels for treatment contrast.

Modules:

NameDescription
codingContrast matrix builders for categorical variable encoding.
namesDesign matrix column name parsing and variable type detection.
referenceReference design matrix (X_ref) construction for marginal effects.
z_matrixSparse Z matrix (random effects design matrix) construction.

Classes

DesignColumnInfo

DesignColumnInfo(raw_name: str, base_term: str, level: str | None, column_type: Literal['intercept', 'continuous', 'categorical'], is_interaction: bool = False) -> None

Parsed design matrix column metadata.

Attributes:

NameTypeDescription
raw_namestrOriginal column name (e.g., “treatment[A]”).
base_termstrBase variable name without level (e.g., “treatment”).
levelstr | NoneLevel value for categorical, None for continuous (e.g., “A”).
column_typeLiteral[‘intercept’, ‘continuous’, ‘categorical’]Type classification.
is_interactionboolWhether this is an interaction term.

Attributes

base_term
base_term: str
column_type
column_type: Literal['intercept', 'continuous', 'categorical']
is_interaction
is_interaction: bool = False
level
level: str | None
raw_name
raw_name: str

RandomEffectsInfo

RandomEffectsInfo(Z: sp.csc_matrix, group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, re_structures_list: list[str] | None = None, re_dims_list: list[int] | None = None, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, column_labels: list[str] = list(), term_permutation: NDArray[np.intp] | None = None) -> None

Complete random effects specification for lmer/glmer.

This container holds the Z matrix and all metadata needed for downstream operations (Lambda building, initialization, results).

Attributes:

NameTypeDescription
Zcsc_matrixSparse random effects design matrix, shape (n, q).
group_ids_listlist[NDArray[intp]]Group ID arrays for each factor.
n_groups_listlist[int]Number of groups per factor.
group_nameslist[str]Names of grouping factors.
random_nameslist[str]Names of random effect terms.
re_structurestrOverall structure type (intercept/slope/diagonal/nested/crossed).
re_structures_listlist[str] | NonePer-factor structure types (for mixed).
X_reNDArray[float64] | list[NDArray[float64]] | NoneRandom effects covariates (for slopes).
column_labelslist[str]Z column names for output.
term_permutationNDArray[intp] | NoneBlock ordering permutation indices.

Attributes

X_re
X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None
Z
Z: sp.csc_matrix
column_labels
column_labels: list[str] = field(default_factory=list)
group_ids_list
group_ids_list: list[NDArray[np.intp]]
group_names
group_names: list[str]
n_groups_list
n_groups_list: list[int]
random_names
random_names: list[str]
re_dims_list
re_dims_list: list[int] | None = None
re_structure
re_structure: str
re_structures_list
re_structures_list: list[str] | None = None
term_permutation
term_permutation: NDArray[np.intp] | None = None

Functions

array_to_coding_matrix

array_to_coding_matrix(contrasts: NDArray[np.floating] | list[float] | list[list[float]], n_levels: int, *, normalize: bool = True) -> NDArray[np.float64]

Convert user-specified contrasts to a coding matrix for design matrices.

This function converts “human-readable” contrast specifications (where each row represents a hypothesis like “A vs average(B, C)”) into a coding matrix suitable for use in regression design matrices.

The algorithm uses QR decomposition to auto-complete under-specified contrasts with orthogonal contrasts, following the approach from R’s gmodels::make.contrasts() and pymer4’s con2R().

Parameters:

NameTypeDescriptionDefault
contrastsNDArray[floating] | list[float] | list[list[float]]User-specified contrasts as: - 1D array/list: Single contrast vector of length n_levels - 2D array/list: Multiple contrasts, shape (n_contrasts, n_levels) Each row sums to zero for valid contrasts.required
n_levelsintNumber of factor levels. Must match contrast dimensions.required
normalizeboolIf True, normalize each contrast vector by its L2 norm before conversion. This puts contrasts in standard-deviation units similar to orthogonal polynomial contrasts.True

Returns:

TypeDescription
NDArray[float64]Coding matrix of shape (n_levels, n_levels - 1). Each row corresponds
NDArray[float64]to a factor level, each column to a design matrix column.

Examples:

>>> # Single contrast: A vs average(B, C)
>>> array_to_coding_matrix([-1, 0.5, 0.5], n_levels=3)
array([[-0.81649658,  0.        ],
       [ 0.40824829, -0.70710678],
       [ 0.40824829,  0.70710678]])
>>> # Multiple contrasts: A vs B, and (A,B) vs C
>>> array_to_coding_matrix([[-1, 1, 0], [-0.5, -0.5, 1]], n_levels=3)
array([[-0.5       , -0.28867513],
       [ 0.5       , -0.28867513],
       [ 0.        ,  0.57735027]])

Note: The returned matrix has n_levels-1 columns because one degree of freedom is absorbed by the intercept. If you specify fewer than n_levels-1 contrasts, the remaining columns are auto-completed with orthogonal contrasts via QR decomposition.

build_random_effects

build_random_effects(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, re_structures_list: list[str] | None = None, group_levels_list: list[list[str]] | None = None, term_permutation: NDArray[np.intp] | None = None) -> RandomEffectsInfo

Build complete random effects specification.

High-level function that constructs the Z matrix and packages all metadata into a RandomEffectsInfo container ready for lmer/glmer consumption.

Parameters:

NameTypeDescriptionDefault
group_ids_listlist[NDArray[intp]]Group ID arrays for each factor.required
n_groups_listlist[int]Number of groups per factor.required
group_nameslist[str]Names of grouping factors.required
random_nameslist[str]Names of random effect terms.required
re_structurestrOverall structure type: - “intercept”: random intercept only - “slope”: correlated intercept + slopes - “diagonal”: uncorrelated intercept + slopes - “nested”: nested hierarchy - “crossed”: crossed factorsrequired
X_reNDArray[float64] | list[NDArray[float64]] | NoneRandom effects covariates (for slopes).None
re_structures_listlist[str] | NonePer-factor structure (for mixed).None
group_levels_listlist[list[str]] | NoneLevel names per factor (for labels).None
term_permutationNDArray[intp] | NoneBlock ordering permutation.None

Returns:

TypeDescription
RandomEffectsInfoRandomEffectsInfo with Z matrix and all metadata.

Examples:

>>> # (Days|Subject) with 18 subjects
>>> group_ids = np.arange(180) // 10
>>> n_groups = 18
>>> X_re = np.column_stack([np.ones(180), np.tile(np.arange(10), 18)])
>>> info = build_random_effects(
...     group_ids_list=[group_ids],
...     n_groups_list=[n_groups],
...     group_names=["Subject"],
...     random_names=["Intercept", "Days"],
...     re_structure="slope",
...     X_re=X_re,
... )
>>> info.Z.shape
(180, 36)  # 18 subjects * 2 RE

build_reference_design_matrix

build_reference_design_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, levels: list[str], X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarray

Build design matrix for reference grid points.

Creates an X_ref matrix with one row per focal variable level. Each row represents a reference point where the focal variable is set to that level and all other covariates are set to their reference values.

Reference value conventions:

Parameters:

NameTypeDescriptionDefault
X_namestuple[str, ...] | list[str]Column names from the design matrix, in order.required
focal_varstrName of the categorical variable to vary across levels.required
levelslist[str]List of levels for the focal variable, defining row order.required
X_meansndarrayColumn means of the original design matrix, shape (p,). Used for continuous covariate reference values.required
set_categoricalsdict[str, str] | NoneOptional dict mapping non-focal categorical variable names to specific levels to pin them at (instead of marginalizing at X_means). E.g. {"Ethnicity": "Asian"} sets the Ethnicity dummies to indicator values for “Asian”.None

Returns:

TypeDescription
ndarrayReference design matrix X_ref, shape (n_levels, p).

Examples:

Compute X_ref for treatment EMMs::

X_names = ("Intercept", "x", "treatment[A]", "treatment[B]")
X_means = np.array([1.0, 2.5, 0.33, 0.33])  # means from data
levels = ["ref", "A", "B"]

X_ref = build_reference_design_matrix(X_names, "treatment", levels, X_means)
# X_ref[0] = [1.0, 2.5, 0.0, 0.0]  # reference level
# X_ref[1] = [1.0, 2.5, 1.0, 0.0]  # level A
# X_ref[2] = [1.0, 2.5, 0.0, 1.0]  # level B

Note: The first level is typically the reference level (omitted from dummy coding), so its row has all 0s for the focal variable dummies.

build_reference_row

build_reference_row(X_names: tuple[str, ...] | list[str], focal_var: str, focal_level: str, X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarray

Build a single row of the reference design matrix.

Creates one reference point where the focal variable is set to the specified level and other covariates are at reference values.

For interaction columns involving the focal variable (e.g., Income:Student[Yes] when focal_var="Student"), the value is computed as the product of component values rather than using the empirical mean of the interaction column.

Parameters:

NameTypeDescriptionDefault
X_namestuple[str, ...] | list[str]Column names from the design matrix.required
focal_varstrName of the focal categorical variable.required
focal_levelstrLevel value to set for the focal variable.required
X_meansndarrayColumn means for continuous covariate reference values.required
set_categoricalsdict[str, str] | NoneOptional dict mapping non-focal categorical variable names to specific levels for indicator encoding. When a non-focal categorical’s base_term matches a key, the dummy is set to 1.0 if the level matches, 0.0 otherwise (instead of using the column mean for marginalization).None

Returns:

TypeDescription
ndarrayReference row, shape (p,).

build_slope_reference_matrix

build_slope_reference_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, X_means: np.ndarray, *, delta: float = 1.0) -> tuple[np.ndarray, np.ndarray]

Build reference matrices for computing marginal slopes.

Creates two reference rows: one at the mean and one at mean + delta for the focal continuous variable. The slope is then (y1 - y0) / delta.

For interaction columns involving the focal variable (e.g., x:z when focal_var="x"), the interaction value is properly computed as a product of component values, so the perturbed row reflects the interaction contribution to the slope.

Parameters:

NameTypeDescriptionDefault
X_namestuple[str, ...] | list[str]Column names from the design matrix.required
focal_varstrName of the continuous variable for slope computation.required
X_meansndarrayColumn means of the original design matrix.required
deltafloatStep size for numerical differentiation (default 1.0).1.0

Returns:

TypeDescription
ndarrayTuple of (X_ref_0, X_ref_1) where:
ndarray- X_ref_0: Reference point at mean of focal_var
tuple[ndarray, ndarray]- X_ref_1: Reference point at mean + delta of focal_var

build_z_crossed

build_z_crossed(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None, layouts: list[str] | None = None) -> sp.csc_matrix

Build Z matrix for crossed random effects.

Crossed effects like (1|subject) + (1|item) create independent random effects for each factor. The Z matrix is a horizontal concatenation of Z matrices for each factor.

Parameters:

NameTypeDescriptionDefault
group_ids_listlist[NDArray[intp]]List of group ID arrays, one per factor.required
n_groups_listlist[int]Number of groups per factor.required
X_re_listlist[NDArray[float64] | None] | NoneRandom effects design per factor, or None for intercepts.None
layoutslist[str] | NoneLayout per factor. Default: interleaved for all.None

Returns:

TypeDescription
csc_matrixSparse Z matrix, shape (n, sum(n_groups_i * n_re_i)).

Examples:

>>> # (1|subject) + (1|item) with 3 subjects, 4 items
>>> subj_ids = np.array([0, 1, 2, 0, 1, 2])
>>> item_ids = np.array([0, 0, 0, 1, 1, 1])
>>> Z = build_z_crossed(
...     group_ids_list=[subj_ids, item_ids],
...     n_groups_list=[3, 4]
... )
>>> Z.shape
(6, 7)  # 3 subject + 4 item columns

build_z_nested

build_z_nested(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None) -> sp.csc_matrix

Build Z matrix for nested random effects.

Nested effects like (1|school/class) create separate random intercepts for each level of the hierarchy. The Z matrix is a horizontal concatenation of Z matrices for each level.

Parameters:

NameTypeDescriptionDefault
group_ids_listlist[NDArray[intp]]List of group ID arrays, ordered [inner, ..., outer]. For (1school/class): [class_ids, school_ids].
n_groups_listlist[int]Number of groups at each level.required
X_re_listlist[NDArray[float64] | None] | NoneRandom effects design per level, or None for intercepts.None

Returns:

TypeDescription
csc_matrixSparse Z matrix, shape (n, sum(n_groups_i * n_re_i)).

Examples:

>>> # (1|school/class) with 2 schools, 4 classes total
>>> class_ids = np.array([0, 0, 1, 1, 2, 2, 3, 3])
>>> school_ids = np.array([0, 0, 0, 0, 1, 1, 1, 1])
>>> Z = build_z_nested(
...     group_ids_list=[class_ids, school_ids],
...     n_groups_list=[4, 2]
... )
>>> Z.shape
(8, 6)  # 4 class columns + 2 school columns

build_z_simple

build_z_simple(group_ids: NDArray[np.intp], n_groups: int, X_re: NDArray[np.float64] | None = None, layout: Literal['interleaved', 'blocked'] = 'interleaved') -> sp.csc_matrix

Build Z matrix for single grouping factor.

Constructs Z directly in sparse COO format without dense intermediates. For large-scale data (e.g., InstEval with 73k obs x 4k groups), this uses O(n x n_re) memory instead of O(n x n_groups x n_re).

Handles intercept-only, correlated slopes, and uncorrelated slopes by varying the X_re input and layout parameter.

Parameters:

NameTypeDescriptionDefault
group_idsNDArray[intp]Array of group assignments, shape (n,), values 0..n_groups-1.required
n_groupsintTotal number of groups.required
X_reNDArray[float64] | NoneRandom effects design matrix, shape (n, n_re). - None or column of 1s: intercept only - Multiple columns: intercept + slopesNone
layoutLiteral[‘interleaved’, ‘blocked’]Column ordering. - “interleaved”: [g1_int, g1_slope, g2_int, g2_slope, ...] - “blocked”: [g1_int, g2_int, ..., g1_slope, g2_slope, ...]‘interleaved’

Returns:

TypeDescription
csc_matrixSparse Z matrix in CSC format, shape (n, n_groups * n_re).

Examples:

>>> # Random intercept only
>>> group_ids = np.array([0, 0, 1, 1])
>>> Z = build_z_simple(group_ids, n_groups=2)
>>> Z.shape
(4, 2)
>>> # Random intercept + slope (correlated)
>>> X_re = np.column_stack([np.ones(4), [1, 2, 1, 2]])
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="interleaved")
>>> Z.shape
(4, 4)
>>> # Uncorrelated random effects
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="blocked")
>>> Z.shape
(4, 4)

convert_coding_to_hypothesis

convert_coding_to_hypothesis(coding_matrix: NDArray[np.float64]) -> NDArray[np.float64]

Convert a coding matrix back to interpretable hypothesis contrasts.

This is the inverse of array_to_coding_matrix. Given a coding matrix (n_levels, n_levels-1), returns the hypothesis matrix where each row represents the linear combination of factor levels being compared.

Parameters:

NameTypeDescriptionDefault
coding_matrixNDArray[float64]Coding matrix of shape (n_levels, n_levels - 1).required

Returns:

TypeDescription
NDArray[float64]Hypothesis matrix of shape (n_levels - 1, n_levels). Each row
NDArray[float64]represents a contrast hypothesis (coefficients for factor levels).

Examples:

>>> cm = treatment_coding(['A', 'B', 'C'])
>>> convert_coding_to_hypothesis(cm)
array([[-1.,  1.,  0.],
       [-1.,  0.,  1.]])

extract_base_term

extract_base_term(name: str) -> str

Extract base term name from column name.

For categorical variables, strips the level suffix. For interactions, extracts base terms without levels.

Parameters:

NameTypeDescriptionDefault
namestrColumn name from design matrix.required

Returns:

TypeDescription
strBase term name.

Examples:

>>> extract_base_term("treatment[A]")
'treatment'
>>> extract_base_term("x")
'x'
>>> extract_base_term("treatment[A]:x")
'treatment:x'

extract_categorical_variables

extract_categorical_variables(X_names: tuple[str, ...] | list[str]) -> set[str]

Find all categorical base variable names from design matrix columns.

Scans column names for bracket patterns and extracts unique base terms.

Parameters:

NameTypeDescriptionDefault
X_namestuple[str, ...] | list[str]Column names from design matrix.required

Returns:

TypeDescription
set[str]Set of categorical variable names (base terms, no levels).

Examples:

>>> extract_categorical_variables(["Intercept", "x", "treatment[A]", "treatment[B]"])
{'treatment'}

extract_level_from_column

extract_level_from_column(name: str, focal_var: str) -> str | None

Extract level value for a specific focal variable from column name.

Used when building reference grids to identify which column corresponds to which level of the focal variable.

Parameters:

NameTypeDescriptionDefault
namestrColumn name (e.g., “treatment[A]” or “treatment[A]:x”).required
focal_varstrThe focal variable name (e.g., “treatment”).required

Returns:

TypeDescription
str | NoneLevel value if column is for focal_var, else None.

Examples:

>>> extract_level_from_column("treatment[A]", "treatment")
'A'
>>> extract_level_from_column("treatment[A]:x", "treatment")
'A'
>>> extract_level_from_column("x", "treatment")
None
>>> extract_level_from_column("group[1]", "treatment")
None

helmert_coding

helmert_coding(levels: list[str]) -> NDArray[np.float64]

Build Helmert contrast matrix.

Helmert coding compares each level to the mean of all previous levels. Column j contrasts level j+1 against the average of levels 0..j.

This is equivalent to R’s contr.helmert() (scaled to unit contrasts).

Matrix structure for 4 levels::

Contrast   | A       | B       | C       | D
-----------|---------|---------|---------|--------
B vs A     | -1/2    |  1/2    |  0      |  0
C vs A,B   | -1/3    | -1/3    |  2/3    |  0
D vs A,B,C | -1/4    | -1/4    | -1/4    |  3/4

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names. Must have >= 2 elements.required

Returns:

TypeDescription
NDArray[float64]Contrast matrix of shape (n_levels, n_levels - 1).
NDArray[float64]Row order matches input levels order.

Examples:

>>> helmert_coding(['A', 'B', 'C'])
array([[-0.5       , -0.33333333],
       [ 0.5       , -0.33333333],
       [ 0.        ,  0.66666667]])

helmert_coding_labels

helmert_coding_labels(levels: list[str]) -> list[str]

Get column labels for Helmert contrast.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required

Returns:

TypeDescription
list[str]List of labels like [‘B vs prev’, ‘C vs prev’, ...].

Examples:

>>> helmert_coding_labels(['A', 'B', 'C'])
['B vs prev', 'C vs prev']

identify_column_type

identify_column_type(name: str) -> Literal['intercept', 'continuous', 'categorical']

Identify column type from name (simplified version).

This is a lightweight alternative to parse_design_column_name() when only the type is needed.

Parameters:

NameTypeDescriptionDefault
namestrColumn name from design matrix.required

Returns:

TypeDescription
Literal[‘intercept’, ‘continuous’, ‘categorical’]Column type as string literal.

parse_design_column_name

parse_design_column_name(name: str) -> DesignColumnInfo

Parse design matrix column name into components.

Handles standard R/formula naming conventions:

Parameters:

NameTypeDescriptionDefault
namestrColumn name from design matrix.required

Returns:

TypeDescription
DesignColumnInfoDesignColumnInfo with parsed components.

Examples:

>>> parse_design_column_name("Intercept")
DesignColumnInfo(raw_name='Intercept', base_term='Intercept',
                 level=None, column_type='intercept', is_interaction=False)
>>> parse_design_column_name("x")
DesignColumnInfo(raw_name='x', base_term='x',
                 level=None, column_type='continuous', is_interaction=False)
>>> parse_design_column_name("treatment[A]")
DesignColumnInfo(raw_name='treatment[A]', base_term='treatment',
                 level='A', column_type='categorical', is_interaction=False)
>>> parse_design_column_name("treatment[A]:x")
DesignColumnInfo(raw_name='treatment[A]:x', base_term='treatment:x',
                 level='A', column_type='categorical', is_interaction=True)

poly_coding

poly_coding(levels: list[str]) -> NDArray[np.float64]

Build orthogonal polynomial contrast matrix.

Polynomial coding creates orthogonal contrasts representing linear, quadratic, cubic, etc. trends across ordered factor levels. This is equivalent to R’s contr.poly() function.

The contrasts are orthonormal (orthogonal and unit length), making them suitable for testing polynomial trends in ordered categorical variables.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names. The order determines the polynomial evaluation points.required

Returns:

TypeDescription
NDArray[float64]Contrast matrix of shape (n_levels, n_levels - 1).
NDArray[float64]Column 0 is linear (.L), column 1 is quadratic (.Q), etc.

Examples:

>>> poly_coding(['low', 'medium', 'high'])
array([[-0.70710678,  0.40824829],
       [ 0.        , -0.81649658],
       [ 0.70710678,  0.40824829]])
>>> poly_coding(['A', 'B', 'C', 'D'])
array([[-0.67082039,  0.5       , -0.2236068 ],
       [-0.2236068 , -0.5       ,  0.67082039],
       [ 0.2236068 , -0.5       , -0.67082039],
       [ 0.67082039,  0.5       ,  0.2236068 ]])

Note: Level names are not used in computation - only the count and order matter. The polynomial is evaluated at equally-spaced points 1, 2, ..., n.

poly_coding_labels

poly_coding_labels(levels: list[str]) -> list[str]

Get column labels for polynomial contrast.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required

Returns:

TypeDescription
list[str]List of polynomial degree labels: ['.L', '.Q', '.C', '^4', '^5', ...].

sequential_coding

sequential_coding(levels: list[str]) -> NDArray[np.float64]

Build sequential (successive differences) contrast matrix.

Sequential coding compares each level to the previous level, producing contrasts that capture successive differences. This is equivalent to R’s MASS::contr.sdif() function.

The matrix is constructed so that each column j represents the difference between level j+1 and level j. The resulting coefficients in a regression model estimate these successive differences directly.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required

Returns:

TypeDescription
NDArray[float64]Contrast matrix of shape (n_levels, n_levels - 1).
NDArray[float64]Column j represents the contrast: level[j+1] - level[j].

Examples:

>>> sequential_coding(['A', 'B', 'C'])
array([[-0.66666667, -0.33333333],
       [ 0.33333333, -0.33333333],
       [ 0.33333333,  0.66666667]])
>>> sequential_coding(['low', 'medium', 'high', 'very_high'])
array([[-0.75, -0.5 , -0.25],
       [ 0.25, -0.5 , -0.25],
       [ 0.25,  0.5 , -0.25],
       [ 0.25,  0.5 ,  0.75]])

Note: This coding is most meaningful for ordered factors where you want to estimate the “step” from one level to the next. Unlike polynomial contrasts, it does not assume equally-spaced levels.

The matrix structure ensures that multiplying by the coefficient vector gives interpretable successive differences.

sequential_coding_labels

sequential_coding_labels(levels: list[str]) -> list[str]

Get column labels for sequential contrast.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required

Returns:

TypeDescription
list[str]List of successive difference labels like [‘B-A’, ‘C-B’, ...].

Examples:

>>> sequential_coding_labels(['A', 'B', 'C'])
['B-A', 'C-B']
>>> sequential_coding_labels(['low', 'medium', 'high'])
['medium-low', 'high-medium']

sum_coding

sum_coding(levels: list[str], omit: str | None = None) -> NDArray[np.float64]

Build sum (effects) contrast matrix.

Sum coding sets the omitted level to all -1s, and each other level gets a one-hot encoded row. This centers the effects around zero, making coefficients interpretable as deviations from the grand mean.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required
omitstr | NoneLevel to omit (gets -1s). Defaults to last level.None

Returns:

TypeDescription
NDArray[float64]Contrast matrix of shape (n_levels, n_levels - 1).
NDArray[float64]Row order matches input levels order.

Examples:

>>> sum_coding(['A', 'B', 'C'])
array([[ 1.,  0.],
       [ 0.,  1.],
       [-1., -1.]])
>>> sum_coding(['A', 'B', 'C'], omit='A')
array([[-1., -1.],
       [ 1.,  0.],
       [ 0.,  1.]])

sum_coding_labels

sum_coding_labels(levels: list[str], omit: str | None = None) -> list[str]

Get column labels for sum contrast.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required
omitstr | NoneLevel to omit. Defaults to last level.None

Returns:

TypeDescription
list[str]List of non-omitted level names (column labels).

treatment_coding

treatment_coding(levels: list[str], reference: str | None = None) -> NDArray[np.float64]

Build treatment (dummy) contrast matrix.

Treatment coding sets the reference level to all zeros, and each other level gets a one-hot encoded row. This is the most common coding for regression models with an intercept.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required
referencestr | NoneReference level name. Defaults to first level.None

Returns:

TypeDescription
NDArray[float64]Contrast matrix of shape (n_levels, n_levels - 1).
NDArray[float64]Row order matches input levels order.

Examples:

>>> treatment_coding(['A', 'B', 'C'])
array([[0., 0.],
       [1., 0.],
       [0., 1.]])
>>> treatment_coding(['A', 'B', 'C'], reference='B')
array([[1., 0.],
       [0., 0.],
       [0., 1.]])

treatment_coding_labels

treatment_coding_labels(levels: list[str], reference: str | None = None) -> list[str]

Get column labels for treatment contrast.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required
referencestr | NoneReference level name. Defaults to first level.None

Returns:

TypeDescription
list[str]List of non-reference level names (column labels).

Modules

coding

Contrast matrix builders for categorical variable encoding.

This module provides functions to create contrast matrices for encoding categorical variables in design matrices. These are distinct from the EMM contrast matrices in

Key concept: A contrast matrix maps k categorical levels to k-1 columns in the design matrix (assuming an intercept absorbs one degree of freedom).

Examples:

>>> from coding import treatment_coding, sum_coding, poly_coding
>>> treatment_coding(['A', 'B', 'C'])
array([[0., 0.],
       [1., 0.],
       [0., 1.]])
>>> sum_coding(['A', 'B', 'C'])
array([[ 1.,  0.],
       [ 0.,  1.],
       [-1., -1.]])
>>> poly_coding(['A', 'B', 'C'])  # Linear and quadratic trends
array([[-0.707...,  0.408...],
       [ 0.   ..., -0.816...],
       [ 0.707...,  0.408...]])
>>> # Custom contrast: A vs average(B, C)
>>> array_to_coding_matrix([[-1, 0.5, 0.5]], n_levels=3)
array([[-0.816..., ...],
       [ 0.408..., ...],
       [ 0.408..., ...]])

Functions:

NameDescription
array_to_coding_matrixConvert user-specified contrasts to a coding matrix for design matrices.
convert_coding_to_hypothesisConvert a coding matrix back to interpretable hypothesis contrasts.
helmert_codingBuild Helmert contrast matrix.
helmert_coding_labelsGet column labels for Helmert contrast.
poly_codingBuild orthogonal polynomial contrast matrix.
poly_coding_labelsGet column labels for polynomial contrast.
sequential_codingBuild sequential (successive differences) contrast matrix.
sequential_coding_labelsGet column labels for sequential contrast.
sum_codingBuild sum (effects) contrast matrix.
sum_coding_labelsGet column labels for sum contrast.
treatment_codingBuild treatment (dummy) contrast matrix.
treatment_coding_labelsGet column labels for treatment contrast.

Functions

array_to_coding_matrix
array_to_coding_matrix(contrasts: NDArray[np.floating] | list[float] | list[list[float]], n_levels: int, *, normalize: bool = True) -> NDArray[np.float64]

Convert user-specified contrasts to a coding matrix for design matrices.

This function converts “human-readable” contrast specifications (where each row represents a hypothesis like “A vs average(B, C)”) into a coding matrix suitable for use in regression design matrices.

The algorithm uses QR decomposition to auto-complete under-specified contrasts with orthogonal contrasts, following the approach from R’s gmodels::make.contrasts() and pymer4’s con2R().

Parameters:

NameTypeDescriptionDefault
contrastsNDArray[floating] | list[float] | list[list[float]]User-specified contrasts as: - 1D array/list: Single contrast vector of length n_levels - 2D array/list: Multiple contrasts, shape (n_contrasts, n_levels) Each row sums to zero for valid contrasts.required
n_levelsintNumber of factor levels. Must match contrast dimensions.required
normalizeboolIf True, normalize each contrast vector by its L2 norm before conversion. This puts contrasts in standard-deviation units similar to orthogonal polynomial contrasts.True

Returns:

TypeDescription
NDArray[float64]Coding matrix of shape (n_levels, n_levels - 1). Each row corresponds
NDArray[float64]to a factor level, each column to a design matrix column.

Examples:

>>> # Single contrast: A vs average(B, C)
>>> array_to_coding_matrix([-1, 0.5, 0.5], n_levels=3)
array([[-0.81649658,  0.        ],
       [ 0.40824829, -0.70710678],
       [ 0.40824829,  0.70710678]])
>>> # Multiple contrasts: A vs B, and (A,B) vs C
>>> array_to_coding_matrix([[-1, 1, 0], [-0.5, -0.5, 1]], n_levels=3)
array([[-0.5       , -0.28867513],
       [ 0.5       , -0.28867513],
       [ 0.        ,  0.57735027]])

Note: The returned matrix has n_levels-1 columns because one degree of freedom is absorbed by the intercept. If you specify fewer than n_levels-1 contrasts, the remaining columns are auto-completed with orthogonal contrasts via QR decomposition.

convert_coding_to_hypothesis
convert_coding_to_hypothesis(coding_matrix: NDArray[np.float64]) -> NDArray[np.float64]

Convert a coding matrix back to interpretable hypothesis contrasts.

This is the inverse of array_to_coding_matrix. Given a coding matrix (n_levels, n_levels-1), returns the hypothesis matrix where each row represents the linear combination of factor levels being compared.

Parameters:

NameTypeDescriptionDefault
coding_matrixNDArray[float64]Coding matrix of shape (n_levels, n_levels - 1).required

Returns:

TypeDescription
NDArray[float64]Hypothesis matrix of shape (n_levels - 1, n_levels). Each row
NDArray[float64]represents a contrast hypothesis (coefficients for factor levels).

Examples:

>>> cm = treatment_coding(['A', 'B', 'C'])
>>> convert_coding_to_hypothesis(cm)
array([[-1.,  1.,  0.],
       [-1.,  0.,  1.]])
helmert_coding
helmert_coding(levels: list[str]) -> NDArray[np.float64]

Build Helmert contrast matrix.

Helmert coding compares each level to the mean of all previous levels. Column j contrasts level j+1 against the average of levels 0..j.

This is equivalent to R’s contr.helmert() (scaled to unit contrasts).

Matrix structure for 4 levels::

Contrast   | A       | B       | C       | D
-----------|---------|---------|---------|--------
B vs A     | -1/2    |  1/2    |  0      |  0
C vs A,B   | -1/3    | -1/3    |  2/3    |  0
D vs A,B,C | -1/4    | -1/4    | -1/4    |  3/4

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names. Must have >= 2 elements.required

Returns:

TypeDescription
NDArray[float64]Contrast matrix of shape (n_levels, n_levels - 1).
NDArray[float64]Row order matches input levels order.

Examples:

>>> helmert_coding(['A', 'B', 'C'])
array([[-0.5       , -0.33333333],
       [ 0.5       , -0.33333333],
       [ 0.        ,  0.66666667]])
helmert_coding_labels
helmert_coding_labels(levels: list[str]) -> list[str]

Get column labels for Helmert contrast.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required

Returns:

TypeDescription
list[str]List of labels like [‘B vs prev’, ‘C vs prev’, ...].

Examples:

>>> helmert_coding_labels(['A', 'B', 'C'])
['B vs prev', 'C vs prev']
poly_coding
poly_coding(levels: list[str]) -> NDArray[np.float64]

Build orthogonal polynomial contrast matrix.

Polynomial coding creates orthogonal contrasts representing linear, quadratic, cubic, etc. trends across ordered factor levels. This is equivalent to R’s contr.poly() function.

The contrasts are orthonormal (orthogonal and unit length), making them suitable for testing polynomial trends in ordered categorical variables.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names. The order determines the polynomial evaluation points.required

Returns:

TypeDescription
NDArray[float64]Contrast matrix of shape (n_levels, n_levels - 1).
NDArray[float64]Column 0 is linear (.L), column 1 is quadratic (.Q), etc.

Examples:

>>> poly_coding(['low', 'medium', 'high'])
array([[-0.70710678,  0.40824829],
       [ 0.        , -0.81649658],
       [ 0.70710678,  0.40824829]])
>>> poly_coding(['A', 'B', 'C', 'D'])
array([[-0.67082039,  0.5       , -0.2236068 ],
       [-0.2236068 , -0.5       ,  0.67082039],
       [ 0.2236068 , -0.5       , -0.67082039],
       [ 0.67082039,  0.5       ,  0.2236068 ]])

Note: Level names are not used in computation - only the count and order matter. The polynomial is evaluated at equally-spaced points 1, 2, ..., n.

poly_coding_labels
poly_coding_labels(levels: list[str]) -> list[str]

Get column labels for polynomial contrast.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required

Returns:

TypeDescription
list[str]List of polynomial degree labels: ['.L', '.Q', '.C', '^4', '^5', ...].
sequential_coding
sequential_coding(levels: list[str]) -> NDArray[np.float64]

Build sequential (successive differences) contrast matrix.

Sequential coding compares each level to the previous level, producing contrasts that capture successive differences. This is equivalent to R’s MASS::contr.sdif() function.

The matrix is constructed so that each column j represents the difference between level j+1 and level j. The resulting coefficients in a regression model estimate these successive differences directly.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required

Returns:

TypeDescription
NDArray[float64]Contrast matrix of shape (n_levels, n_levels - 1).
NDArray[float64]Column j represents the contrast: level[j+1] - level[j].

Examples:

>>> sequential_coding(['A', 'B', 'C'])
array([[-0.66666667, -0.33333333],
       [ 0.33333333, -0.33333333],
       [ 0.33333333,  0.66666667]])
>>> sequential_coding(['low', 'medium', 'high', 'very_high'])
array([[-0.75, -0.5 , -0.25],
       [ 0.25, -0.5 , -0.25],
       [ 0.25,  0.5 , -0.25],
       [ 0.25,  0.5 ,  0.75]])

Note: This coding is most meaningful for ordered factors where you want to estimate the “step” from one level to the next. Unlike polynomial contrasts, it does not assume equally-spaced levels.

The matrix structure ensures that multiplying by the coefficient vector gives interpretable successive differences.

sequential_coding_labels
sequential_coding_labels(levels: list[str]) -> list[str]

Get column labels for sequential contrast.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required

Returns:

TypeDescription
list[str]List of successive difference labels like [‘B-A’, ‘C-B’, ...].

Examples:

>>> sequential_coding_labels(['A', 'B', 'C'])
['B-A', 'C-B']
>>> sequential_coding_labels(['low', 'medium', 'high'])
['medium-low', 'high-medium']
sum_coding
sum_coding(levels: list[str], omit: str | None = None) -> NDArray[np.float64]

Build sum (effects) contrast matrix.

Sum coding sets the omitted level to all -1s, and each other level gets a one-hot encoded row. This centers the effects around zero, making coefficients interpretable as deviations from the grand mean.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required
omitstr | NoneLevel to omit (gets -1s). Defaults to last level.None

Returns:

TypeDescription
NDArray[float64]Contrast matrix of shape (n_levels, n_levels - 1).
NDArray[float64]Row order matches input levels order.

Examples:

>>> sum_coding(['A', 'B', 'C'])
array([[ 1.,  0.],
       [ 0.,  1.],
       [-1., -1.]])
>>> sum_coding(['A', 'B', 'C'], omit='A')
array([[-1., -1.],
       [ 1.,  0.],
       [ 0.,  1.]])
sum_coding_labels
sum_coding_labels(levels: list[str], omit: str | None = None) -> list[str]

Get column labels for sum contrast.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required
omitstr | NoneLevel to omit. Defaults to last level.None

Returns:

TypeDescription
list[str]List of non-omitted level names (column labels).
treatment_coding
treatment_coding(levels: list[str], reference: str | None = None) -> NDArray[np.float64]

Build treatment (dummy) contrast matrix.

Treatment coding sets the reference level to all zeros, and each other level gets a one-hot encoded row. This is the most common coding for regression models with an intercept.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required
referencestr | NoneReference level name. Defaults to first level.None

Returns:

TypeDescription
NDArray[float64]Contrast matrix of shape (n_levels, n_levels - 1).
NDArray[float64]Row order matches input levels order.

Examples:

>>> treatment_coding(['A', 'B', 'C'])
array([[0., 0.],
       [1., 0.],
       [0., 1.]])
>>> treatment_coding(['A', 'B', 'C'], reference='B')
array([[1., 0.],
       [0., 0.],
       [0., 1.]])
treatment_coding_labels
treatment_coding_labels(levels: list[str], reference: str | None = None) -> list[str]

Get column labels for treatment contrast.

Parameters:

NameTypeDescriptionDefault
levelslist[str]Ordered list of categorical level names.required
referencestr | NoneReference level name. Defaults to first level.None

Returns:

TypeDescription
list[str]List of non-reference level names (column labels).

names

Design matrix column name parsing and variable type detection.

Classes:

NameDescription
DesignColumnInfoParsed design matrix column metadata.

Functions:

NameDescription
extract_base_termExtract base term name from column name.
extract_categorical_variablesFind all categorical base variable names from design matrix columns.
identify_column_typeIdentify column type from name (simplified version).
parse_design_column_nameParse design matrix column name into components.

Classes

DesignColumnInfo
DesignColumnInfo(raw_name: str, base_term: str, level: str | None, column_type: Literal['intercept', 'continuous', 'categorical'], is_interaction: bool = False) -> None

Parsed design matrix column metadata.

Attributes:

NameTypeDescription
raw_namestrOriginal column name (e.g., “treatment[A]”).
base_termstrBase variable name without level (e.g., “treatment”).
levelstr | NoneLevel value for categorical, None for continuous (e.g., “A”).
column_typeLiteral[‘intercept’, ‘continuous’, ‘categorical’]Type classification.
is_interactionboolWhether this is an interaction term.
Attributes
base_term
base_term: str
column_type
column_type: Literal['intercept', 'continuous', 'categorical']
is_interaction
is_interaction: bool = False
level
level: str | None
raw_name
raw_name: str

Functions

extract_base_term
extract_base_term(name: str) -> str

Extract base term name from column name.

For categorical variables, strips the level suffix. For interactions, extracts base terms without levels.

Parameters:

NameTypeDescriptionDefault
namestrColumn name from design matrix.required

Returns:

TypeDescription
strBase term name.

Examples:

>>> extract_base_term("treatment[A]")
'treatment'
>>> extract_base_term("x")
'x'
>>> extract_base_term("treatment[A]:x")
'treatment:x'
extract_categorical_variables
extract_categorical_variables(X_names: tuple[str, ...] | list[str]) -> set[str]

Find all categorical base variable names from design matrix columns.

Scans column names for bracket patterns and extracts unique base terms.

Parameters:

NameTypeDescriptionDefault
X_namestuple[str, ...] | list[str]Column names from design matrix.required

Returns:

TypeDescription
set[str]Set of categorical variable names (base terms, no levels).

Examples:

>>> extract_categorical_variables(["Intercept", "x", "treatment[A]", "treatment[B]"])
{'treatment'}
identify_column_type
identify_column_type(name: str) -> Literal['intercept', 'continuous', 'categorical']

Identify column type from name (simplified version).

This is a lightweight alternative to parse_design_column_name() when only the type is needed.

Parameters:

NameTypeDescriptionDefault
namestrColumn name from design matrix.required

Returns:

TypeDescription
Literal[‘intercept’, ‘continuous’, ‘categorical’]Column type as string literal.
parse_design_column_name
parse_design_column_name(name: str) -> DesignColumnInfo

Parse design matrix column name into components.

Handles standard R/formula naming conventions:

Parameters:

NameTypeDescriptionDefault
namestrColumn name from design matrix.required

Returns:

TypeDescription
DesignColumnInfoDesignColumnInfo with parsed components.

Examples:

>>> parse_design_column_name("Intercept")
DesignColumnInfo(raw_name='Intercept', base_term='Intercept',
                 level=None, column_type='intercept', is_interaction=False)
>>> parse_design_column_name("x")
DesignColumnInfo(raw_name='x', base_term='x',
                 level=None, column_type='continuous', is_interaction=False)
>>> parse_design_column_name("treatment[A]")
DesignColumnInfo(raw_name='treatment[A]', base_term='treatment',
                 level='A', column_type='categorical', is_interaction=False)
>>> parse_design_column_name("treatment[A]:x")
DesignColumnInfo(raw_name='treatment[A]:x', base_term='treatment:x',
                 level='A', column_type='categorical', is_interaction=True)

reference

Reference design matrix (X_ref) construction for marginal effects.

Functions:

NameDescription
build_continuous_reference_matrixBuild reference matrix for a continuous focal variable at specific values.
build_counterfactual_design_matricesBuild counterfactual design matrices for g-computation.
build_reference_design_matrixBuild design matrix for reference grid points.
build_reference_rowBuild a single row of the reference design matrix.

Functions

build_continuous_reference_matrix
build_continuous_reference_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, at_values: tuple[float, ...], X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarray

Build reference matrix for a continuous focal variable at specific values.

Creates one row per value in at_values. Each row has covariates at their means except the focal variable, which is set to the specified value.

For interaction columns involving the focal variable, the interaction value is properly computed as a product of component values at each at_value.

Parameters:

NameTypeDescriptionDefault
X_namestuple[str, ...] | list[str]Column names from the design matrix.required
focal_varstrName of the continuous focal variable.required
at_valuestuple[float, ...]Specific values to evaluate the focal variable at.required
X_meansndarrayColumn means of the original design matrix, shape (p,).required
set_categoricalsdict[str, str] | NoneOptional dict mapping non-focal categorical variable names to specific levels for indicator encoding instead of marginalizing at column means.None

Returns:

TypeDescription
ndarrayReference design matrix X_ref, shape (n_values, p).
build_counterfactual_design_matrices
build_counterfactual_design_matrices(X: np.ndarray, X_names: tuple[str, ...] | list[str], focal_var: str, levels: list[str]) -> list[np.ndarray]

Build counterfactual design matrices for g-computation.

For each focal level, creates a modified copy of the full design matrix X where the focal variable’s indicator columns are set to match that level and interaction columns involving the focal variable are recomputed from component values. Non-focal columns are left unchanged, preserving the observed covariate distribution.

This is the core building block for weights="observed" (g-computation / counterfactual prediction). Each returned matrix answers: “What would the design matrix look like if every observation were assigned to this focal level?”

Parameters:

NameTypeDescriptionDefault
XndarrayOriginal design matrix, shape (N, p).required
X_namestuple[str, ...] | list[str]Column names from the design matrix, in order.required
focal_varstrName of the categorical focal variable.required
levelslist[str]List of levels to compute counterfactuals for.required

Returns:

TypeDescription
list[ndarray]List of counterfactual design matrices (one per level), each shape
list[ndarray](N, p). Order matches levels.

Examples:

For a model y ~ treatment + x + treatment:x with X_names = ("Intercept", "x", "treatment[B]", "x:treatment[B]")::

mats = build_counterfactual_design_matrices(X, X_names, "treatment", ["ref", "B"])
# mats[0]: treatment set to ref for all rows (treatment[B]=0, x:treatment[B]=0)
# mats[1]: treatment set to B for all rows (treatment[B]=1, x:treatment[B]=x_i)
build_reference_design_matrix
build_reference_design_matrix(X_names: tuple[str, ...] | list[str], focal_var: str, levels: list[str], X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarray

Build design matrix for reference grid points.

Creates an X_ref matrix with one row per focal variable level. Each row represents a reference point where the focal variable is set to that level and all other covariates are set to their reference values.

Reference value conventions:

Parameters:

NameTypeDescriptionDefault
X_namestuple[str, ...] | list[str]Column names from the design matrix, in order.required
focal_varstrName of the categorical variable to vary across levels.required
levelslist[str]List of levels for the focal variable, defining row order.required
X_meansndarrayColumn means of the original design matrix, shape (p,). Used for continuous covariate reference values.required
set_categoricalsdict[str, str] | NoneOptional dict mapping non-focal categorical variable names to specific levels to pin them at (instead of marginalizing at X_means). E.g. {"Ethnicity": "Asian"} sets the Ethnicity dummies to indicator values for “Asian”.None

Returns:

TypeDescription
ndarrayReference design matrix X_ref, shape (n_levels, p).

Examples:

Compute X_ref for treatment EMMs::

X_names = ("Intercept", "x", "treatment[A]", "treatment[B]")
X_means = np.array([1.0, 2.5, 0.33, 0.33])  # means from data
levels = ["ref", "A", "B"]

X_ref = build_reference_design_matrix(X_names, "treatment", levels, X_means)
# X_ref[0] = [1.0, 2.5, 0.0, 0.0]  # reference level
# X_ref[1] = [1.0, 2.5, 1.0, 0.0]  # level A
# X_ref[2] = [1.0, 2.5, 0.0, 1.0]  # level B

Note: The first level is typically the reference level (omitted from dummy coding), so its row has all 0s for the focal variable dummies.

build_reference_row
build_reference_row(X_names: tuple[str, ...] | list[str], focal_var: str, focal_level: str, X_means: np.ndarray, *, set_categoricals: dict[str, str] | None = None) -> np.ndarray

Build a single row of the reference design matrix.

Creates one reference point where the focal variable is set to the specified level and other covariates are at reference values.

For interaction columns involving the focal variable (e.g., Income:Student[Yes] when focal_var="Student"), the value is computed as the product of component values rather than using the empirical mean of the interaction column.

Parameters:

NameTypeDescriptionDefault
X_namestuple[str, ...] | list[str]Column names from the design matrix.required
focal_varstrName of the focal categorical variable.required
focal_levelstrLevel value to set for the focal variable.required
X_meansndarrayColumn means for continuous covariate reference values.required
set_categoricalsdict[str, str] | NoneOptional dict mapping non-focal categorical variable names to specific levels for indicator encoding. When a non-focal categorical’s base_term matches a key, the dummy is set to 1.0 if the level matches, 0.0 otherwise (instead of using the column mean for marginalization).None

Returns:

TypeDescription
ndarrayReference row, shape (p,).

z_matrix

Sparse Z matrix (random effects design matrix) construction.

Classes:

NameDescription
RandomEffectsInfoComplete random effects specification for lmer/glmer.

Functions:

NameDescription
build_random_effectsBuild complete random effects specification.
build_z_crossedBuild Z matrix for crossed random effects.
build_z_nestedBuild Z matrix for nested random effects.
build_z_simpleBuild Z matrix for single grouping factor.

Classes

RandomEffectsInfo
RandomEffectsInfo(Z: sp.csc_matrix, group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, re_structures_list: list[str] | None = None, re_dims_list: list[int] | None = None, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, column_labels: list[str] = list(), term_permutation: NDArray[np.intp] | None = None) -> None

Complete random effects specification for lmer/glmer.

This container holds the Z matrix and all metadata needed for downstream operations (Lambda building, initialization, results).

Attributes:

NameTypeDescription
Zcsc_matrixSparse random effects design matrix, shape (n, q).
group_ids_listlist[NDArray[intp]]Group ID arrays for each factor.
n_groups_listlist[int]Number of groups per factor.
group_nameslist[str]Names of grouping factors.
random_nameslist[str]Names of random effect terms.
re_structurestrOverall structure type (intercept/slope/diagonal/nested/crossed).
re_structures_listlist[str] | NonePer-factor structure types (for mixed).
X_reNDArray[float64] | list[NDArray[float64]] | NoneRandom effects covariates (for slopes).
column_labelslist[str]Z column names for output.
term_permutationNDArray[intp] | NoneBlock ordering permutation indices.
Attributes
X_re
X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None
Z
Z: sp.csc_matrix
column_labels
column_labels: list[str] = field(default_factory=list)
group_ids_list
group_ids_list: list[NDArray[np.intp]]
group_names
group_names: list[str]
n_groups_list
n_groups_list: list[int]
random_names
random_names: list[str]
re_dims_list
re_dims_list: list[int] | None = None
re_structure
re_structure: str
re_structures_list
re_structures_list: list[str] | None = None
term_permutation
term_permutation: NDArray[np.intp] | None = None

Functions

build_random_effects
build_random_effects(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], group_names: list[str], random_names: list[str], re_structure: str, X_re: NDArray[np.float64] | list[NDArray[np.float64]] | None = None, re_structures_list: list[str] | None = None, group_levels_list: list[list[str]] | None = None, term_permutation: NDArray[np.intp] | None = None) -> RandomEffectsInfo

Build complete random effects specification.

High-level function that constructs the Z matrix and packages all metadata into a RandomEffectsInfo container ready for lmer/glmer consumption.

Parameters:

NameTypeDescriptionDefault
group_ids_listlist[NDArray[intp]]Group ID arrays for each factor.required
n_groups_listlist[int]Number of groups per factor.required
group_nameslist[str]Names of grouping factors.required
random_nameslist[str]Names of random effect terms.required
re_structurestrOverall structure type: - “intercept”: random intercept only - “slope”: correlated intercept + slopes - “diagonal”: uncorrelated intercept + slopes - “nested”: nested hierarchy - “crossed”: crossed factorsrequired
X_reNDArray[float64] | list[NDArray[float64]] | NoneRandom effects covariates (for slopes).None
re_structures_listlist[str] | NonePer-factor structure (for mixed).None
group_levels_listlist[list[str]] | NoneLevel names per factor (for labels).None
term_permutationNDArray[intp] | NoneBlock ordering permutation.None

Returns:

TypeDescription
RandomEffectsInfoRandomEffectsInfo with Z matrix and all metadata.

Examples:

>>> # (Days|Subject) with 18 subjects
>>> group_ids = np.arange(180) // 10
>>> n_groups = 18
>>> X_re = np.column_stack([np.ones(180), np.tile(np.arange(10), 18)])
>>> info = build_random_effects(
...     group_ids_list=[group_ids],
...     n_groups_list=[n_groups],
...     group_names=["Subject"],
...     random_names=["Intercept", "Days"],
...     re_structure="slope",
...     X_re=X_re,
... )
>>> info.Z.shape
(180, 36)  # 18 subjects * 2 RE
build_z_crossed
build_z_crossed(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None, layouts: list[str] | None = None) -> sp.csc_matrix

Build Z matrix for crossed random effects.

Crossed effects like (1|subject) + (1|item) create independent random effects for each factor. The Z matrix is a horizontal concatenation of Z matrices for each factor.

Parameters:

NameTypeDescriptionDefault
group_ids_listlist[NDArray[intp]]List of group ID arrays, one per factor.required
n_groups_listlist[int]Number of groups per factor.required
X_re_listlist[NDArray[float64] | None] | NoneRandom effects design per factor, or None for intercepts.None
layoutslist[str] | NoneLayout per factor. Default: interleaved for all.None

Returns:

TypeDescription
csc_matrixSparse Z matrix, shape (n, sum(n_groups_i * n_re_i)).

Examples:

>>> # (1|subject) + (1|item) with 3 subjects, 4 items
>>> subj_ids = np.array([0, 1, 2, 0, 1, 2])
>>> item_ids = np.array([0, 0, 0, 1, 1, 1])
>>> Z = build_z_crossed(
...     group_ids_list=[subj_ids, item_ids],
...     n_groups_list=[3, 4]
... )
>>> Z.shape
(6, 7)  # 3 subject + 4 item columns
build_z_nested
build_z_nested(group_ids_list: list[NDArray[np.intp]], n_groups_list: list[int], X_re_list: list[NDArray[np.float64] | None] | None = None) -> sp.csc_matrix

Build Z matrix for nested random effects.

Nested effects like (1|school/class) create separate random intercepts for each level of the hierarchy. The Z matrix is a horizontal concatenation of Z matrices for each level.

Parameters:

NameTypeDescriptionDefault
group_ids_listlist[NDArray[intp]]List of group ID arrays, ordered [inner, ..., outer]. For (1school/class): [class_ids, school_ids].
n_groups_listlist[int]Number of groups at each level.required
X_re_listlist[NDArray[float64] | None] | NoneRandom effects design per level, or None for intercepts.None

Returns:

TypeDescription
csc_matrixSparse Z matrix, shape (n, sum(n_groups_i * n_re_i)).

Examples:

>>> # (1|school/class) with 2 schools, 4 classes total
>>> class_ids = np.array([0, 0, 1, 1, 2, 2, 3, 3])
>>> school_ids = np.array([0, 0, 0, 0, 1, 1, 1, 1])
>>> Z = build_z_nested(
...     group_ids_list=[class_ids, school_ids],
...     n_groups_list=[4, 2]
... )
>>> Z.shape
(8, 6)  # 4 class columns + 2 school columns
build_z_simple
build_z_simple(group_ids: NDArray[np.intp], n_groups: int, X_re: NDArray[np.float64] | None = None, layout: Literal['interleaved', 'blocked'] = 'interleaved') -> sp.csc_matrix

Build Z matrix for single grouping factor.

Constructs Z directly in sparse COO format without dense intermediates. For large-scale data (e.g., InstEval with 73k obs x 4k groups), this uses O(n x n_re) memory instead of O(n x n_groups x n_re).

Handles intercept-only, correlated slopes, and uncorrelated slopes by varying the X_re input and layout parameter.

Parameters:

NameTypeDescriptionDefault
group_idsNDArray[intp]Array of group assignments, shape (n,), values 0..n_groups-1.required
n_groupsintTotal number of groups.required
X_reNDArray[float64] | NoneRandom effects design matrix, shape (n, n_re). - None or column of 1s: intercept only - Multiple columns: intercept + slopesNone
layoutLiteral[‘interleaved’, ‘blocked’]Column ordering. - “interleaved”: [g1_int, g1_slope, g2_int, g2_slope, ...] - “blocked”: [g1_int, g2_int, ..., g1_slope, g2_slope, ...]‘interleaved’

Returns:

TypeDescription
csc_matrixSparse Z matrix in CSC format, shape (n, n_groups * n_re).

Examples:

>>> # Random intercept only
>>> group_ids = np.array([0, 0, 1, 1])
>>> Z = build_z_simple(group_ids, n_groups=2)
>>> Z.shape
(4, 2)
>>> # Random intercept + slope (correlated)
>>> X_re = np.column_stack([np.ones(4), [1, 2, 1, 2]])
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="interleaved")
>>> Z.shape
(4, 4)
>>> # Uncorrelated random effects
>>> Z = build_z_simple(group_ids, n_groups=2, X_re=X_re, layout="blocked")
>>> Z.shape
(4, 4)