Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Generalized Linear Models

UCSD Psychology

Not all outcomes are continuous and normally distributed. When your response is binary (yes/no), a count (0, 1, 2, ...), or always positive, ordinary least squares doesn’t work well --- it can predict impossible values like negative counts or probabilities greater than one. Generalized linear models (GLMs) extend regression to these cases through two key ideas: a link function that maps predictions to the right scale, and iteratively reweighted least squares (IRLS) that fits the model by solving a sequence of weighted regression problems.

GLMs unify a wide range of models under one framework. Logistic regression, Poisson regression, and gamma regression are all GLMs --- they differ only in the choice of link function and variance structure. Once you understand the IRLS machinery, every GLM works the same way.

Beyond normal outcomes

Binary data

Consider a concrete example: predicting whether a student passes (1) or fails (0) an exam based on hours studied. The probability of passing increases with study time, but it must stay between 0 and 1. A straight line can’t enforce that constraint.

n = 200
hours = np.random.uniform(1, 10, n)
prob_pass = expit(-3 + 0.8 * hours)  # true logistic model
passed = np.random.binomial(1, prob_pass)

print(f"Students who passed: {passed.sum()}/{n}")
print(f"Pass rate at 2 hours: {expit(-3 + 0.8 * 2):.1%}")
print(f"Pass rate at 8 hours: {expit(-3 + 0.8 * 8):.1%}")
Students who passed: 142/200
Pass rate at 2 hours: 19.8%
Pass rate at 8 hours: 96.8%

If we tried to fit this with OLS, the model would happily predict probabilities below 0 for students who barely study and above 1 for students who study a lot. We need a link function to keep predictions on the right scale.


The idea

A link function gg maps the expected response μ\mu to the linear predictor η=Xβ\eta = \mathbf{X}\boldsymbol{\beta}:

The inverse link maps back: given any linear predictor η\eta, the inverse link produces a valid expected response μ\mu. For the logit link, the inverse is the logistic function μ=1/(1+eη)\mu = 1 / (1 + e^{-\eta}), which always returns a value between 0 and 1.

from bossanova.internal.maths.family import logit_link, logit_link_inverse

probs = np.array([0.1, 0.25, 0.5, 0.75, 0.9])
etas = logit_link(probs)
back = logit_link_inverse(etas)

for p, e, b in zip(probs, etas, back):
    print(f"  p={p:.2f} → η={e:.4f} → back to p={b:.2f}")
  p=0.10 → η=-2.1972 → back to p=0.10
  p=0.25 → η=-1.0986 → back to p=0.25
  p=0.50 → η=0.0000 → back to p=0.50
  p=0.75 → η=1.0986 → back to p=0.75
  p=0.90 → η=2.1972 → back to p=0.90

Weighted normal equations

The problem

In GLMs, observations have unequal variance. For binary data, the variance of an observation with predicted probability μ\mu is μ(1μ)\mu(1 - \mu) --- observations near 0.5 are more variable than those near 0 or 1. We need weighted least squares to account for this.

Building intuition

At each iteration of IRLS, we solve a weighted version of the normal equations:

XW(yXβ^)=0\mathbf{X}^\top\mathbf{W}(\mathbf{y}^* - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}

where W\mathbf{W} is a diagonal weight matrix derived from the current fitted values and y\mathbf{y}^* is the “working response” --- a linearized version of the actual response. This is the same orthogonality condition as in OLS, but now weighted by W\mathbf{W} to account for the heterogeneous variance across observations.

The formal statement

Theorem WLS.1: At the WLS solution, XW(yXβ^)=0\mathbf{X}^\top\mathbf{W}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}.

# Manual WLS: beta = (X'WX)^{-1} X'Wy
X = np.column_stack([np.ones(n), hours])
mu_init = np.full(n, passed.mean())  # initial guess
W = np.diag(mu_init * (1 - mu_init))  # binomial variance

XtWX = X.T @ W @ X
XtWy = X.T @ W @ passed
beta_wls = np.linalg.solve(XtWX, XtWy)

residuals = passed - X @ beta_wls
weighted_score = X.T @ W @ residuals
print(f"Weighted score (should be ~0): {weighted_score}")
Weighted score (should be ~0): [-0. -0.]
How does this relate to OLS?

This is the same orthogonality condition as OLS.1, but weighted. The weights account for heterogeneous variance: observations with higher variance get less weight. When all weights are equal (W=I\mathbf{W} = \mathbf{I}), this reduces exactly to the ordinary normal equations X(yXβ^)=0\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}.


IRLS: iterating to the solution

The problem

Unlike OLS, GLMs don’t have a closed-form solution. The weights depend on the fitted values, which depend on the coefficients, which depend on the weights. We need to iterate.

Building intuition

IRLS works by repeatedly:

  1. Compute the current fitted values μ\mu and their derivatives

  2. Construct working weights W\mathbf{W} and working response y\mathbf{y}^*

  3. Solve the weighted least squares problem

  4. Update μ\mu and check for convergence

Each step is just a weighted regression --- something we already know how to do. The key insight is that by updating the weights and working response at each step, we’re climbing the likelihood surface toward the maximum likelihood estimate.

# Simple IRLS for logistic regression
beta = np.zeros(2)
for iteration in range(8):
    eta = X @ beta
    mu = expit(eta)
    w = mu * (1 - mu)
    W = np.diag(w)
    z = eta + (passed - mu) / w  # working response

    beta_new = np.linalg.solve(X.T @ W @ X, X.T @ W @ z)

    # Deviance (negative log-likelihood * 2)
    mu_clipped = np.clip(mu, 1e-10, 1 - 1e-10)
    deviance = -2 * np.sum(passed * np.log(mu_clipped) + (1 - passed) * np.log(1 - mu_clipped))

    change = np.max(np.abs(beta_new - beta))
    print(f"  Iteration {iteration}: deviance={deviance:.2f}, max|Δβ|={change:.6f}")
    beta = beta_new

    if change < 1e-8:
        print(f"  Converged!")
        break
  Iteration 0: deviance=277.26, max|Δβ|=1.395698
  Iteration 1: deviance=167.52, max|Δβ|=0.759401
  Iteration 2: deviance=151.18, max|Δβ|=0.453346
  Iteration 3: deviance=148.25, max|Δβ|=0.133020
  Iteration 4: deviance=148.09, max|Δβ|=0.009093
  Iteration 5: deviance=148.09, max|Δβ|=0.000040
  Iteration 6: deviance=148.09, max|Δβ|=0.000000
  Converged!

Score equations at convergence

The problem

How do we know IRLS found the right answer? At the maximum likelihood estimate, the gradient of the log-likelihood (called the score) must be zero.

Building intuition

For a GLM with canonical link, the score has a remarkably simple form: X(yμ)=0\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) = \mathbf{0}. This says the residuals (yμ)(y - \mu) must be orthogonal to every predictor column --- the same geometric condition as OLS, but now on the probability scale rather than the response scale.

The formal statement

Theorem WLS.3: At convergence for a GLM with canonical link, X(yμ)0\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) \approx \mathbf{0}.

mu_final = expit(X @ beta)
score = X.T @ (passed - mu_final)
print(f"Score at convergence: {score}")
print(f"Max |score|: {np.max(np.abs(score)):.2e}")
Score at convergence: [0. 0.]
Max |score|: 1.96e-13
Why does the weight matrix disappear?

For canonical links (logit for binomial, log for Poisson), the weight matrix in the score equations cancels out, giving the elegant form X(yμ)=0\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) = \mathbf{0}. This is a self-consistency condition: the predicted means are consistent with the data in the sense that no predictor can explain any remaining systematic pattern in the residuals.

Seeing it in bossanova

from bossanova.internal.maths.family import binomial
from bossanova.internal.maths.solvers import fit_glm_irls

fam = binomial()
result = fit_glm_irls(passed, X, weights=np.ones(n), family=fam)

print(f"Coefficients: {result['coef']}")
print(f"Converged: {result['converged']}")
print(f"Iterations: {result['n_iter']}")

# Verify score equations
mu_bn = fam.link_inverse(X @ result["coef"])
print(f"Score at convergence: {X.T @ (passed - mu_bn)}")
Coefficients: [-2.7506  0.8444]
Converged: True
Iterations: 6
Score at convergence: [-0.  0.]

Summary

TheoremStatementWhat it means
WLS.1XW(yXβ^)=0\mathbf{X}^\top\mathbf{W}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}Weighted residual orthogonality
WLS.3X(yμ)0\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) \approx \mathbf{0} at convergenceScore equations satisfied