Generalized Linear Models

Not all outcomes are continuous and normally distributed. When your response is binary (yes/no), a count (0, 1, 2, ...), or always positive, ordinary least squares doesn’t work well --- it can predict impossible values like negative counts or probabilities greater than one. Generalized linear models (GLMs) extend regression to these cases through two key ideas: a link function that maps predictions to the right scale, and iteratively reweighted least squares (IRLS) that fits the model by solving a sequence of weighted regression problems.

GLMs unify a wide range of models under one framework. Logistic regression, Poisson regression, and gamma regression are all GLMs --- they differ only in the choice of link function and variance structure. Once you understand the IRLS machinery, every GLM works the same way.

Beyond normal outcomes¶

Binary data¶

Consider a concrete example: predicting whether a student passes (1) or fails (0) an exam based on hours studied. The probability of passing increases with study time, but it must stay between 0 and 1. A straight line can’t enforce that constraint.

n = 200
hours = np.random.uniform(1, 10, n)
prob_pass = expit(-3 + 0.8 * hours)  # true logistic model
passed = np.random.binomial(1, prob_pass)

print(f"Students who passed: {passed.sum()}/{n}")
print(f"Pass rate at 2 hours: {expit(-3 + 0.8 * 2):.1%}")
print(f"Pass rate at 8 hours: {expit(-3 + 0.8 * 8):.1%}")

Students who passed: 142/200
Pass rate at 2 hours: 19.8%
Pass rate at 8 hours: 96.8%

If we tried to fit this with OLS, the model would happily predict probabilities below 0 for students who barely study and above 1 for students who study a lot. We need a link function to keep predictions on the right scale.

Link functions¶

The idea¶

A link function $g$ maps the expected response $\mu$ to the linear predictor $\eta = \mathbf{X}\boldsymbol{\beta}$ :

Identity: $\eta = \mu$ (standard regression --- no transformation)
Logit: $\eta = \log(\mu / (1 - \mu))$ (binary outcomes --- maps (0, 1) to all reals)
Log: $\eta = \log(\mu)$ (count data --- maps positive reals to all reals)

The inverse link maps back: given any linear predictor $\eta$ , the inverse link produces a valid expected response $\mu$ . For the logit link, the inverse is the logistic function $\mu = 1 / (1 + e^{-\eta})$ , which always returns a value between 0 and 1.

from bossanova.internal.maths.family import logit_link, logit_link_inverse

probs = np.array([0.1, 0.25, 0.5, 0.75, 0.9])
etas = logit_link(probs)
back = logit_link_inverse(etas)

for p, e, b in zip(probs, etas, back):
    print(f"  p={p:.2f} → η={e:.4f} → back to p={b:.2f}")

  p=0.10 → η=-2.1972 → back to p=0.10
  p=0.25 → η=-1.0986 → back to p=0.25
  p=0.50 → η=0.0000 → back to p=0.50
  p=0.75 → η=1.0986 → back to p=0.75
  p=0.90 → η=2.1972 → back to p=0.90

Weighted normal equations¶

The problem¶

In GLMs, observations have unequal variance. For binary data, the variance of an observation with predicted probability $\mu$ is $\mu(1 - \mu)$ --- observations near 0.5 are more variable than those near 0 or 1. We need weighted least squares to account for this.

Building intuition¶

At each iteration of IRLS, we solve a weighted version of the normal equations:

\mathbf{X}^\top\mathbf{W}(\mathbf{y}^* - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}

(1)

where $\mathbf{W}$ is a diagonal weight matrix derived from the current fitted values and $\mathbf{y}^*$ is the “working response” --- a linearized version of the actual response. This is the same orthogonality condition as in OLS, but now weighted by $\mathbf{W}$ to account for the heterogeneous variance across observations.

The formal statement¶

Theorem WLS.1: At the WLS solution, $\mathbf{X}^\top\mathbf{W}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}$ .

# Manual WLS: beta = (X'WX)^{-1} X'Wy
X = np.column_stack([np.ones(n), hours])
mu_init = np.full(n, passed.mean())  # initial guess
W = np.diag(mu_init * (1 - mu_init))  # binomial variance

XtWX = X.T @ W @ X
XtWy = X.T @ W @ passed
beta_wls = np.linalg.solve(XtWX, XtWy)

residuals = passed - X @ beta_wls
weighted_score = X.T @ W @ residuals
print(f"Weighted score (should be ~0): {weighted_score}")

Weighted score (should be ~0): [-0. -0.]

IRLS: iterating to the solution¶

The problem¶

Unlike OLS, GLMs don’t have a closed-form solution. The weights depend on the fitted values, which depend on the coefficients, which depend on the weights. We need to iterate.

Building intuition¶

IRLS works by repeatedly:

Compute the current fitted values $\mu$ and their derivatives
Construct working weights $\mathbf{W}$ and working response $\mathbf{y}^*$
Solve the weighted least squares problem
Update $\mu$ and check for convergence

Each step is just a weighted regression --- something we already know how to do. The key insight is that by updating the weights and working response at each step, we’re climbing the likelihood surface toward the maximum likelihood estimate.

# Simple IRLS for logistic regression
beta = np.zeros(2)
for iteration in range(8):
    eta = X @ beta
    mu = expit(eta)
    w = mu * (1 - mu)
    W = np.diag(w)
    z = eta + (passed - mu) / w  # working response

    beta_new = np.linalg.solve(X.T @ W @ X, X.T @ W @ z)

    # Deviance (negative log-likelihood * 2)
    mu_clipped = np.clip(mu, 1e-10, 1 - 1e-10)
    deviance = -2 * np.sum(passed * np.log(mu_clipped) + (1 - passed) * np.log(1 - mu_clipped))

    change = np.max(np.abs(beta_new - beta))
    print(f"  Iteration {iteration}: deviance={deviance:.2f}, max|Δβ|={change:.6f}")
    beta = beta_new

    if change < 1e-8:
        print(f"  Converged!")
        break

  Iteration 0: deviance=277.26, max|Δβ|=1.395698
  Iteration 1: deviance=167.52, max|Δβ|=0.759401
  Iteration 2: deviance=151.18, max|Δβ|=0.453346
  Iteration 3: deviance=148.25, max|Δβ|=0.133020
  Iteration 4: deviance=148.09, max|Δβ|=0.009093
  Iteration 5: deviance=148.09, max|Δβ|=0.000040
  Iteration 6: deviance=148.09, max|Δβ|=0.000000
  Converged!

Score equations at convergence¶

The problem¶

How do we know IRLS found the right answer? At the maximum likelihood estimate, the gradient of the log-likelihood (called the score) must be zero.

Building intuition¶

For a GLM with canonical link, the score has a remarkably simple form: $\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) = \mathbf{0}$ . This says the residuals $(y - \mu)$ must be orthogonal to every predictor column --- the same geometric condition as OLS, but now on the probability scale rather than the response scale.

The formal statement¶

Theorem WLS.3: At convergence for a GLM with canonical link, $\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) \approx \mathbf{0}$ .

mu_final = expit(X @ beta)
score = X.T @ (passed - mu_final)
print(f"Score at convergence: {score}")
print(f"Max |score|: {np.max(np.abs(score)):.2e}")

Score at convergence: [0. 0.]
Max |score|: 1.96e-13

Seeing it in bossanova¶

from bossanova.internal.maths.family import binomial
from bossanova.internal.maths.solvers import fit_glm_irls

fam = binomial()
result = fit_glm_irls(passed, X, weights=np.ones(n), family=fam)

print(f"Coefficients: {result['coef']}")
print(f"Converged: {result['converged']}")
print(f"Iterations: {result['n_iter']}")

# Verify score equations
mu_bn = fam.link_inverse(X @ result["coef"])
print(f"Score at convergence: {X.T @ (passed - mu_bn)}")

Coefficients: [-2.7506  0.8444]
Converged: True
Iterations: 6
Score at convergence: [-0.  0.]

Summary¶

Theorem	Statement	What it means
WLS.1	$\mathbf{X}^\top\mathbf{W}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}$	Weighted residual orthogonality
WLS.3	$\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) \approx \mathbf{0}$ at convergence	Score equations satisfied