Mixed Models - bossanova

Mixed models handle data with natural grouping — students within classrooms, measurements within patients, items within categories. They estimate both population-level effects (what’s true on average?) and group-level variation (how much do groups differ?).

The math is more complex than OLS because we’re simultaneously estimating coefficients and variance components, but the core ideas build directly on the linear algebra and OLS foundations from earlier pages. If OLS finds the best flat line through your data, mixed models find the best flat line while accounting for the fact that different groups may have their own lines.

The mixed model equation¶

The problem¶

In grouped data, observations within the same group are correlated. Students in the same classroom share a teacher, a curriculum, a room temperature. OLS assumes independence, which leads to incorrect standard errors — typically too small, making us overconfident in our estimates.

The model¶

The linear mixed model is:

\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{b} + \boldsymbol{\varepsilon}

(1)

where:

$\mathbf{X}\boldsymbol{\beta}$ are the fixed effects (population-level): the average relationship between predictors and outcome
$\mathbf{Z}\mathbf{b}$ are the random effects (group-level deviations): how much each group deviates from the population average
$\boldsymbol{\varepsilon}$ is the residual noise: what’s left over after accounting for both

n_groups = 10
n_per_group = 20
n = n_groups * n_per_group

# Group assignments
group_id = np.repeat(np.arange(n_groups), n_per_group)

# Fixed effects: intercept + one predictor (study hours)
hours = np.random.normal(5, 2, n)
X = np.column_stack([np.ones(n), hours])

# Random intercepts for each classroom
true_group_sd = 3.0
group_effects = np.random.normal(0, true_group_sd, n_groups)

# Generate outcome
beta_true = np.array([70.0, 2.0])  # intercept, slope
y = X @ beta_true + group_effects[group_id] + np.random.normal(0, 2, n)

print(f"True fixed effects: β = {beta_true}")
print(f"True group SD: σ_group = {true_group_sd}")
print(f"Group effects range: [{group_effects.min():.1f}, {group_effects.max():.1f}]")

True fixed effects: β = [70.  2.]
True group SD: σ_group = 3.0
Group effects range: [-4.1, 11.6]

The $\mathbf{Z}$ matrix is an $n \times q$ indicator matrix that maps each observation to its group. Each row has a single 1 in the column corresponding to that observation’s group.

# Z matrix: indicator for group membership
Z = np.zeros((n, n_groups))
for i in range(n):
    Z[i, group_id[i]] = 1

print(f"Z shape: {Z.shape}")
print(f"Z[0, :] = {Z[0, :].astype(int)}  (student 0 is in group {group_id[0]})")
print(f"Students per group: {Z.sum(axis=0).astype(int)}")

Z shape: (200, 10)
Z[0, :] = [1 0 0 0 0 0 0 0 0 0]  (student 0 is in group 0)
Students per group: [20 20 20 20 20 20 20 20 20 20]

Lambda and theta: parameterizing random effects¶

The problem¶

The random effects $\mathbf{b} \sim N(\mathbf{0}, \sigma^2_b \mathbf{I})$ have unknown variance $\sigma^2_b$ . We need to estimate it alongside the fixed effects and residual variance.

Building intuition¶

lme4 (and bossanova) reparameterize the random effects using a relative covariance factor $\boldsymbol{\Lambda}$ (Lambda):

\mathbf{b} = \boldsymbol{\Lambda} \mathbf{u}, \quad \mathbf{u} \sim N(\mathbf{0}, \sigma^2\mathbf{I})

(2)

The matrix $\boldsymbol{\Lambda}$ is controlled by a small parameter vector $\boldsymbol{\theta}$ (theta). For a random intercept model, $\boldsymbol{\theta}$ is a single number: the ratio $\sigma_{\text{group}} / \sigma_{\text{residual}}$ .

Theorem LMM.4--5: The random effects covariance is $\text{Var}(\mathbf{b}) = \sigma^2 \boldsymbol{\Lambda}\boldsymbol{\Lambda}^\top$ , parameterized by $\boldsymbol{\theta}$ .

# For random intercepts: Lambda = theta * I
# theta = sigma_group / sigma_residual
theta = true_group_sd / 2.0  # true ratio (group_sd / residual_sd)
Lambda = theta * np.eye(n_groups)

print(f"theta = {theta:.2f} (ratio of SDs)")
print(f"Lambda (diagonal): {np.diag(Lambda)}")
print(f"Var(b) = sigma² * Lambda @ Lambda.T")
print(f"       = {2.0**2} * {theta**2:.2f} * I = {2.0**2 * theta**2:.2f} * I")

theta = 1.50 (ratio of SDs)
Lambda (diagonal): [1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5]
Var(b) = sigma² * Lambda @ Lambda.T
       = 4.0 * 2.25 * I = 9.00 * I

Penalized least squares (PLS)¶

The problem¶

We need to find $\boldsymbol{\beta}$ , $\mathbf{b}$ , and $\sigma^2$ simultaneously. The PLS approach jointly optimizes over all parameters.

Building intuition¶

The penalized residual sum of squares is:

\text{PWRSS}(\boldsymbol{\theta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{Z}\boldsymbol{\Lambda}\mathbf{u}\|^2 + \|\mathbf{u}\|^2

(3)

The first term is the usual RSS — how well do we fit the data? The second term is a penalty that shrinks the random effects toward zero — it’s what makes mixed models “borrow strength” across groups.

Theorem LMM.3: At the PLS solution, the penalized weighted residual sum of squares is minimized.

# Group-specific OLS (no pooling)
group_means_ols = np.array([y[group_id == g].mean() for g in range(n_groups)])

# Shrinkage estimate: weighted average of group mean and grand mean
grand_mean = y.mean()
shrinkage = n_per_group / (n_per_group + (2.0 / true_group_sd)**2)  # approximate
group_means_shrunk = shrinkage * group_means_ols + (1 - shrinkage) * grand_mean

print(f"Shrinkage factor: {shrinkage:.3f}")
print(f"\n{'Group':>5}  {'OLS':>8}  {'Shrunk':>8}  {'True':>8}")
for g in range(n_groups):
    true_mean = beta_true[0] + group_effects[g]
    print(f"  {g:>3}  {group_means_ols[g]:>8.2f}  {group_means_shrunk[g]:>8.2f}  {true_mean:>8.2f}")

Shrinkage factor: 0.978

Group       OLS    Shrunk      True
    0     80.53     80.56     71.07
    1     80.62     80.64     71.68
    2     82.72     82.69     73.25
    3     83.17     83.14     73.16
    4     76.07     76.19     65.87
    5     78.10     78.18     67.19
    6     80.41     80.44     71.55
    7     82.02     82.01     71.54
    8     83.38     83.34     71.55
    9     90.63     90.44     81.56

The Schur complement trick¶

The problem¶

The full mixed model system is large: we’re solving for both $\boldsymbol{\beta}$ ( $p$ parameters) and $\mathbf{u}$ ( $q$ random effects) simultaneously. For large datasets with many groups, this can be a very big matrix.

Building intuition¶

The Schur complement lets us eliminate $\mathbf{u}$ from the system and solve a smaller ( $p \times p$ ) system for $\boldsymbol{\beta}$ alone.

Theorem LMM.1: The Schur complement reduces the $(p+q) \times (p+q)$ system to a $p \times p$ system for $\boldsymbol{\beta}$ .

# The augmented system is:
# [Z'Z + I    Z'X] [u]   [Z'y]
# [X'Z       X'X ] [β] = [X'y]
#
# Schur complement eliminates u:
# (X'X - X'Z(Z'Z + I)^{-1}Z'X) β = X'y - X'Z(Z'Z + I)^{-1}Z'y

ZtZ = Z.T @ Z
I_q = np.eye(n_groups)
ZtZ_plus_I_inv = np.linalg.inv(ZtZ + I_q / theta**2)

# Reduced system for beta
A = X.T @ X - X.T @ Z @ ZtZ_plus_I_inv @ Z.T @ X
b_rhs = X.T @ y - X.T @ Z @ ZtZ_plus_I_inv @ Z.T @ y
beta_pls = np.linalg.solve(A, b_rhs)

print(f"Fixed effects via PLS: {beta_pls}")
print(f"True fixed effects:    {beta_true}")
print(f"\nSystem reduced from {X.shape[1] + n_groups}x{X.shape[1] + n_groups} to {X.shape[1]}x{X.shape[1]}")

Fixed effects via PLS: [71.8449  2.0165]
True fixed effects:    [70.  2.]

System reduced from 12x12 to 2x2

Optimality conditions¶

Theorem LMM.2: At the PLS solution, the gradient of the PWRSS with respect to $\boldsymbol{\beta}$ and $\mathbf{u}$ is zero.

This is the mixed-model analog of OLS.1 (normal equations). At the optimum, the augmented residuals are orthogonal to both $\mathbf{X}$ (for $\boldsymbol{\beta}$ ) and $\mathbf{Z}\boldsymbol{\Lambda}$ (for $\mathbf{u}$ ).

# Recover u from beta
u_hat = ZtZ_plus_I_inv @ Z.T @ (y - X @ beta_pls)
b_hat = theta * u_hat  # random effects on original scale

residuals = y - X @ beta_pls - Z @ b_hat

# Check optimality: X'e ≈ 0 and Z'e + u/theta² ≈ 0
print(f"X'e (should be ~0): {X.T @ residuals}")
print(f"Z'e + u_penalty (should be ~0): max = {np.max(np.abs(Z.T @ residuals + u_hat / theta**2)):.6f}")

X'e (should be ~0): [ 0.     34.6542]
Z'e + u_penalty (should be ~0): max = 85.042201

Summary¶

Theorem	Statement	What it means
LMM.1	Schur complement reduction	Efficient solving for $\boldsymbol{\beta}$
LMM.2	Gradient = 0 at solution	Optimality conditions
LMM.3	PWRSS minimized	Penalized least squares objective
LMM.4--5	$\text{Var}(\mathbf{b}) = \sigma^2\boldsymbol{\Lambda}\boldsymbol{\Lambda}^\top$ via $\boldsymbol{\theta}$	Random effects parameterization