Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Mixed models handle data with natural grouping — students within classrooms, measurements within patients, items within categories. They estimate both population-level effects (what’s true on average?) and group-level variation (how much do groups differ?).

The math is more complex than OLS because we’re simultaneously estimating coefficients and variance components, but the core ideas build directly on the linear algebra and OLS foundations from earlier pages. If OLS finds the best flat line through your data, mixed models find the best flat line while accounting for the fact that different groups may have their own lines.

The mixed model equation

The problem

In grouped data, observations within the same group are correlated. Students in the same classroom share a teacher, a curriculum, a room temperature. OLS assumes independence, which leads to incorrect standard errors — typically too small, making us overconfident in our estimates.

The model

The linear mixed model is:

y=Xβ+Zb+ε\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{b} + \boldsymbol{\varepsilon}

where:

n_groups = 10
n_per_group = 20
n = n_groups * n_per_group

# Group assignments
group_id = np.repeat(np.arange(n_groups), n_per_group)

# Fixed effects: intercept + one predictor (study hours)
hours = np.random.normal(5, 2, n)
X = np.column_stack([np.ones(n), hours])

# Random intercepts for each classroom
true_group_sd = 3.0
group_effects = np.random.normal(0, true_group_sd, n_groups)

# Generate outcome
beta_true = np.array([70.0, 2.0])  # intercept, slope
y = X @ beta_true + group_effects[group_id] + np.random.normal(0, 2, n)

print(f"True fixed effects: β = {beta_true}")
print(f"True group SD: σ_group = {true_group_sd}")
print(f"Group effects range: [{group_effects.min():.1f}, {group_effects.max():.1f}]")
True fixed effects: β = [70.  2.]
True group SD: σ_group = 3.0
Group effects range: [-4.1, 11.6]

The Z\mathbf{Z} matrix is an n×qn \times q indicator matrix that maps each observation to its group. Each row has a single 1 in the column corresponding to that observation’s group.

# Z matrix: indicator for group membership
Z = np.zeros((n, n_groups))
for i in range(n):
    Z[i, group_id[i]] = 1

print(f"Z shape: {Z.shape}")
print(f"Z[0, :] = {Z[0, :].astype(int)}  (student 0 is in group {group_id[0]})")
print(f"Students per group: {Z.sum(axis=0).astype(int)}")
Z shape: (200, 10)
Z[0, :] = [1 0 0 0 0 0 0 0 0 0]  (student 0 is in group 0)
Students per group: [20 20 20 20 20 20 20 20 20 20]

Lambda and theta: parameterizing random effects

The problem

The random effects bN(0,σb2I)\mathbf{b} \sim N(\mathbf{0}, \sigma^2_b \mathbf{I}) have unknown variance σb2\sigma^2_b. We need to estimate it alongside the fixed effects and residual variance.

Building intuition

lme4 (and bossanova) reparameterize the random effects using a relative covariance factor Λ\boldsymbol{\Lambda} (Lambda):

b=Λu,uN(0,σ2I)\mathbf{b} = \boldsymbol{\Lambda} \mathbf{u}, \quad \mathbf{u} \sim N(\mathbf{0}, \sigma^2\mathbf{I})

The matrix Λ\boldsymbol{\Lambda} is controlled by a small parameter vector θ\boldsymbol{\theta} (theta). For a random intercept model, θ\boldsymbol{\theta} is a single number: the ratio σgroup/σresidual\sigma_{\text{group}} / \sigma_{\text{residual}}.

Theorem LMM.4--5: The random effects covariance is Var(b)=σ2ΛΛ\text{Var}(\mathbf{b}) = \sigma^2 \boldsymbol{\Lambda}\boldsymbol{\Lambda}^\top, parameterized by θ\boldsymbol{\theta}.

# For random intercepts: Lambda = theta * I
# theta = sigma_group / sigma_residual
theta = true_group_sd / 2.0  # true ratio (group_sd / residual_sd)
Lambda = theta * np.eye(n_groups)

print(f"theta = {theta:.2f} (ratio of SDs)")
print(f"Lambda (diagonal): {np.diag(Lambda)}")
print(f"Var(b) = sigma² * Lambda @ Lambda.T")
print(f"       = {2.0**2} * {theta**2:.2f} * I = {2.0**2 * theta**2:.2f} * I")
theta = 1.50 (ratio of SDs)
Lambda (diagonal): [1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5]
Var(b) = sigma² * Lambda @ Lambda.T
       = 4.0 * 2.25 * I = 9.00 * I
Why reparameterize with theta?

The θ\boldsymbol{\theta} parameterization has a key advantage: Λ\boldsymbol{\Lambda} is always defined (even at θ=0\boldsymbol{\theta} = 0, meaning no group variation), the optimization is unconstrained on (0,)(0, \infty), and the number of parameters is small. For a random intercept model, θ\boldsymbol{\theta} is a single scalar. Even for complex random-effects structures (random intercepts and slopes), θ\boldsymbol{\theta} stays compact — it only has as many elements as there are unique entries in the lower-triangular Cholesky factor of the random-effects covariance.


Penalized least squares (PLS)

The problem

We need to find β\boldsymbol{\beta}, b\mathbf{b}, and σ2\sigma^2 simultaneously. The PLS approach jointly optimizes over all parameters.

Building intuition

The penalized residual sum of squares is:

PWRSS(θ)=yXβZΛu2+u2\text{PWRSS}(\boldsymbol{\theta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{Z}\boldsymbol{\Lambda}\mathbf{u}\|^2 + \|\mathbf{u}\|^2

The first term is the usual RSS — how well do we fit the data? The second term is a penalty that shrinks the random effects toward zero — it’s what makes mixed models “borrow strength” across groups.

Theorem LMM.3: At the PLS solution, the penalized weighted residual sum of squares is minimized.

# Group-specific OLS (no pooling)
group_means_ols = np.array([y[group_id == g].mean() for g in range(n_groups)])

# Shrinkage estimate: weighted average of group mean and grand mean
grand_mean = y.mean()
shrinkage = n_per_group / (n_per_group + (2.0 / true_group_sd)**2)  # approximate
group_means_shrunk = shrinkage * group_means_ols + (1 - shrinkage) * grand_mean

print(f"Shrinkage factor: {shrinkage:.3f}")
print(f"\n{'Group':>5}  {'OLS':>8}  {'Shrunk':>8}  {'True':>8}")
for g in range(n_groups):
    true_mean = beta_true[0] + group_effects[g]
    print(f"  {g:>3}  {group_means_ols[g]:>8.2f}  {group_means_shrunk[g]:>8.2f}  {true_mean:>8.2f}")
Shrinkage factor: 0.978

Group       OLS    Shrunk      True
    0     80.53     80.56     71.07
    1     80.62     80.64     71.68
    2     82.72     82.69     73.25
    3     83.17     83.14     73.16
    4     76.07     76.19     65.87
    5     78.10     78.18     67.19
    6     80.41     80.44     71.55
    7     82.02     82.01     71.54
    8     83.38     83.34     71.55
    9     90.63     90.44     81.56
Why does shrinkage help?

The u2\|\mathbf{u}\|^2 penalty pulls group effects toward zero (the population mean). Groups with little data get pulled more strongly. This “shrinkage” or “partial pooling” is what makes mixed models better than fitting each group separately — especially for groups with few observations. A classroom with only 3 students gives a noisy estimate of its mean; the penalty pulls that noisy estimate back toward the overall average, which is usually a better guess than the raw 3-student mean.


The Schur complement trick

The problem

The full mixed model system is large: we’re solving for both β\boldsymbol{\beta} (pp parameters) and u\mathbf{u} (qq random effects) simultaneously. For large datasets with many groups, this can be a very big matrix.

Building intuition

The Schur complement lets us eliminate u\mathbf{u} from the system and solve a smaller (p×pp \times p) system for β\boldsymbol{\beta} alone.

Theorem LMM.1: The Schur complement reduces the (p+q)×(p+q)(p+q) \times (p+q) system to a p×pp \times p system for β\boldsymbol{\beta}.

# The augmented system is:
# [Z'Z + I    Z'X] [u]   [Z'y]
# [X'Z       X'X ] [β] = [X'y]
#
# Schur complement eliminates u:
# (X'X - X'Z(Z'Z + I)^{-1}Z'X) β = X'y - X'Z(Z'Z + I)^{-1}Z'y

ZtZ = Z.T @ Z
I_q = np.eye(n_groups)
ZtZ_plus_I_inv = np.linalg.inv(ZtZ + I_q / theta**2)

# Reduced system for beta
A = X.T @ X - X.T @ Z @ ZtZ_plus_I_inv @ Z.T @ X
b_rhs = X.T @ y - X.T @ Z @ ZtZ_plus_I_inv @ Z.T @ y
beta_pls = np.linalg.solve(A, b_rhs)

print(f"Fixed effects via PLS: {beta_pls}")
print(f"True fixed effects:    {beta_true}")
print(f"\nSystem reduced from {X.shape[1] + n_groups}x{X.shape[1] + n_groups} to {X.shape[1]}x{X.shape[1]}")
Fixed effects via PLS: [71.8449  2.0165]
True fixed effects:    [70.  2.]

System reduced from 12x12 to 2x2
Why does the Schur complement matter?

Without the Schur complement, we’d solve a (p+q)×(p+q)(p+q) \times (p+q) system. With it, we solve a p×pp \times p system (typically much smaller since pqp \ll q). This is crucial for models with many groups — solving a 200×200200 \times 200 system is much faster than a 10,200×10,20010{,}200 \times 10{,}200 system. In practice, bossanova uses sparse Cholesky factorization (not dense inverse) to make this even more efficient, but the mathematical idea is the same.


Optimality conditions

Theorem LMM.2: At the PLS solution, the gradient of the PWRSS with respect to β\boldsymbol{\beta} and u\mathbf{u} is zero.

This is the mixed-model analog of OLS.1 (normal equations). At the optimum, the augmented residuals are orthogonal to both X\mathbf{X} (for β\boldsymbol{\beta}) and ZΛ\mathbf{Z}\boldsymbol{\Lambda} (for u\mathbf{u}).

# Recover u from beta
u_hat = ZtZ_plus_I_inv @ Z.T @ (y - X @ beta_pls)
b_hat = theta * u_hat  # random effects on original scale

residuals = y - X @ beta_pls - Z @ b_hat

# Check optimality: X'e ≈ 0 and Z'e + u/theta² ≈ 0
print(f"X'e (should be ~0): {X.T @ residuals}")
print(f"Z'e + u_penalty (should be ~0): max = {np.max(np.abs(Z.T @ residuals + u_hat / theta**2)):.6f}")
X'e (should be ~0): [ 0.     34.6542]
Z'e + u_penalty (should be ~0): max = 85.042201

Summary

TheoremStatementWhat it means
LMM.1Schur complement reductionEfficient solving for β\boldsymbol{\beta}
LMM.2Gradient = 0 at solutionOptimality conditions
LMM.3PWRSS minimizedPenalized least squares objective
LMM.4--5Var(b)=σ2ΛΛ\text{Var}(\mathbf{b}) = \sigma^2\boldsymbol{\Lambda}\boldsymbol{\Lambda}^\top via θ\boldsymbol{\theta}Random effects parameterization