Mixed models handle data with natural grouping — students within classrooms, measurements within patients, items within categories. They estimate both population-level effects (what’s true on average?) and group-level variation (how much do groups differ?).
The math is more complex than OLS because we’re simultaneously estimating coefficients and variance components, but the core ideas build directly on the linear algebra and OLS foundations from earlier pages. If OLS finds the best flat line through your data, mixed models find the best flat line while accounting for the fact that different groups may have their own lines.
The mixed model equation¶
The problem¶
In grouped data, observations within the same group are correlated. Students in the same classroom share a teacher, a curriculum, a room temperature. OLS assumes independence, which leads to incorrect standard errors — typically too small, making us overconfident in our estimates.
The model¶
The linear mixed model is:
where:
are the fixed effects (population-level): the average relationship between predictors and outcome
are the random effects (group-level deviations): how much each group deviates from the population average
is the residual noise: what’s left over after accounting for both
n_groups = 10
n_per_group = 20
n = n_groups * n_per_group
# Group assignments
group_id = np.repeat(np.arange(n_groups), n_per_group)
# Fixed effects: intercept + one predictor (study hours)
hours = np.random.normal(5, 2, n)
X = np.column_stack([np.ones(n), hours])
# Random intercepts for each classroom
true_group_sd = 3.0
group_effects = np.random.normal(0, true_group_sd, n_groups)
# Generate outcome
beta_true = np.array([70.0, 2.0]) # intercept, slope
y = X @ beta_true + group_effects[group_id] + np.random.normal(0, 2, n)
print(f"True fixed effects: β = {beta_true}")
print(f"True group SD: σ_group = {true_group_sd}")
print(f"Group effects range: [{group_effects.min():.1f}, {group_effects.max():.1f}]")True fixed effects: β = [70. 2.]
True group SD: σ_group = 3.0
Group effects range: [-4.1, 11.6]
The matrix is an indicator matrix that maps each observation to its group. Each row has a single 1 in the column corresponding to that observation’s group.
# Z matrix: indicator for group membership
Z = np.zeros((n, n_groups))
for i in range(n):
Z[i, group_id[i]] = 1
print(f"Z shape: {Z.shape}")
print(f"Z[0, :] = {Z[0, :].astype(int)} (student 0 is in group {group_id[0]})")
print(f"Students per group: {Z.sum(axis=0).astype(int)}")Z shape: (200, 10)
Z[0, :] = [1 0 0 0 0 0 0 0 0 0] (student 0 is in group 0)
Students per group: [20 20 20 20 20 20 20 20 20 20]
Lambda and theta: parameterizing random effects¶
The problem¶
The random effects have unknown variance . We need to estimate it alongside the fixed effects and residual variance.
Building intuition¶
lme4 (and bossanova) reparameterize the random effects using a relative covariance factor (Lambda):
The matrix is controlled by a small parameter vector (theta). For a random intercept model, is a single number: the ratio .
Theorem LMM.4--5: The random effects covariance is , parameterized by .
# For random intercepts: Lambda = theta * I
# theta = sigma_group / sigma_residual
theta = true_group_sd / 2.0 # true ratio (group_sd / residual_sd)
Lambda = theta * np.eye(n_groups)
print(f"theta = {theta:.2f} (ratio of SDs)")
print(f"Lambda (diagonal): {np.diag(Lambda)}")
print(f"Var(b) = sigma² * Lambda @ Lambda.T")
print(f" = {2.0**2} * {theta**2:.2f} * I = {2.0**2 * theta**2:.2f} * I")theta = 1.50 (ratio of SDs)
Lambda (diagonal): [1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5]
Var(b) = sigma² * Lambda @ Lambda.T
= 4.0 * 2.25 * I = 9.00 * I
Why reparameterize with theta?
The parameterization has a key advantage: is always defined (even at , meaning no group variation), the optimization is unconstrained on , and the number of parameters is small. For a random intercept model, is a single scalar. Even for complex random-effects structures (random intercepts and slopes), stays compact — it only has as many elements as there are unique entries in the lower-triangular Cholesky factor of the random-effects covariance.
Penalized least squares (PLS)¶
The problem¶
We need to find , , and simultaneously. The PLS approach jointly optimizes over all parameters.
Building intuition¶
The penalized residual sum of squares is:
The first term is the usual RSS — how well do we fit the data? The second term is a penalty that shrinks the random effects toward zero — it’s what makes mixed models “borrow strength” across groups.
Theorem LMM.3: At the PLS solution, the penalized weighted residual sum of squares is minimized.
# Group-specific OLS (no pooling)
group_means_ols = np.array([y[group_id == g].mean() for g in range(n_groups)])
# Shrinkage estimate: weighted average of group mean and grand mean
grand_mean = y.mean()
shrinkage = n_per_group / (n_per_group + (2.0 / true_group_sd)**2) # approximate
group_means_shrunk = shrinkage * group_means_ols + (1 - shrinkage) * grand_mean
print(f"Shrinkage factor: {shrinkage:.3f}")
print(f"\n{'Group':>5} {'OLS':>8} {'Shrunk':>8} {'True':>8}")
for g in range(n_groups):
true_mean = beta_true[0] + group_effects[g]
print(f" {g:>3} {group_means_ols[g]:>8.2f} {group_means_shrunk[g]:>8.2f} {true_mean:>8.2f}")Shrinkage factor: 0.978
Group OLS Shrunk True
0 80.53 80.56 71.07
1 80.62 80.64 71.68
2 82.72 82.69 73.25
3 83.17 83.14 73.16
4 76.07 76.19 65.87
5 78.10 78.18 67.19
6 80.41 80.44 71.55
7 82.02 82.01 71.54
8 83.38 83.34 71.55
9 90.63 90.44 81.56
Why does shrinkage help?
The penalty pulls group effects toward zero (the population mean). Groups with little data get pulled more strongly. This “shrinkage” or “partial pooling” is what makes mixed models better than fitting each group separately — especially for groups with few observations. A classroom with only 3 students gives a noisy estimate of its mean; the penalty pulls that noisy estimate back toward the overall average, which is usually a better guess than the raw 3-student mean.
The Schur complement trick¶
The problem¶
The full mixed model system is large: we’re solving for both ( parameters) and ( random effects) simultaneously. For large datasets with many groups, this can be a very big matrix.
Building intuition¶
The Schur complement lets us eliminate from the system and solve a smaller () system for alone.
Theorem LMM.1: The Schur complement reduces the system to a system for .
# The augmented system is:
# [Z'Z + I Z'X] [u] [Z'y]
# [X'Z X'X ] [β] = [X'y]
#
# Schur complement eliminates u:
# (X'X - X'Z(Z'Z + I)^{-1}Z'X) β = X'y - X'Z(Z'Z + I)^{-1}Z'y
ZtZ = Z.T @ Z
I_q = np.eye(n_groups)
ZtZ_plus_I_inv = np.linalg.inv(ZtZ + I_q / theta**2)
# Reduced system for beta
A = X.T @ X - X.T @ Z @ ZtZ_plus_I_inv @ Z.T @ X
b_rhs = X.T @ y - X.T @ Z @ ZtZ_plus_I_inv @ Z.T @ y
beta_pls = np.linalg.solve(A, b_rhs)
print(f"Fixed effects via PLS: {beta_pls}")
print(f"True fixed effects: {beta_true}")
print(f"\nSystem reduced from {X.shape[1] + n_groups}x{X.shape[1] + n_groups} to {X.shape[1]}x{X.shape[1]}")Fixed effects via PLS: [71.8449 2.0165]
True fixed effects: [70. 2.]
System reduced from 12x12 to 2x2
Why does the Schur complement matter?
Without the Schur complement, we’d solve a system. With it, we solve a system (typically much smaller since ). This is crucial for models with many groups — solving a system is much faster than a system. In practice, bossanova uses sparse Cholesky factorization (not dense inverse) to make this even more efficient, but the mathematical idea is the same.
Optimality conditions¶
Theorem LMM.2: At the PLS solution, the gradient of the PWRSS with respect to and is zero.
This is the mixed-model analog of OLS.1 (normal equations). At the optimum, the augmented residuals are orthogonal to both (for ) and (for ).
# Recover u from beta
u_hat = ZtZ_plus_I_inv @ Z.T @ (y - X @ beta_pls)
b_hat = theta * u_hat # random effects on original scale
residuals = y - X @ beta_pls - Z @ b_hat
# Check optimality: X'e ≈ 0 and Z'e + u/theta² ≈ 0
print(f"X'e (should be ~0): {X.T @ residuals}")
print(f"Z'e + u_penalty (should be ~0): max = {np.max(np.abs(Z.T @ residuals + u_hat / theta**2)):.6f}")X'e (should be ~0): [ 0. 34.6542]
Z'e + u_penalty (should be ~0): max = 85.042201
Summary¶
| Theorem | Statement | What it means |
|---|---|---|
| LMM.1 | Schur complement reduction | Efficient solving for |
| LMM.2 | Gradient = 0 at solution | Optimality conditions |
| LMM.3 | PWRSS minimized | Penalized least squares objective |
| LMM.4--5 | via | Random effects parameterization |