Not all outcomes are continuous and normally distributed. When your response is binary (yes/no), a count (0, 1, 2, ...), or always positive, ordinary least squares doesn’t work well --- it can predict impossible values like negative counts or probabilities greater than one. Generalized linear models (GLMs) extend regression to these cases through two key ideas: a link function that maps predictions to the right scale, and iteratively reweighted least squares (IRLS) that fits the model by solving a sequence of weighted regression problems.
GLMs unify a wide range of models under one framework. Logistic regression, Poisson regression, and gamma regression are all GLMs --- they differ only in the choice of link function and variance structure. Once you understand the IRLS machinery, every GLM works the same way.
Beyond normal outcomes¶
Binary data¶
Consider a concrete example: predicting whether a student passes (1) or fails (0) an exam based on hours studied. The probability of passing increases with study time, but it must stay between 0 and 1. A straight line can’t enforce that constraint.
n = 200
hours = np.random.uniform(1, 10, n)
prob_pass = expit(-3 + 0.8 * hours) # true logistic model
passed = np.random.binomial(1, prob_pass)
print(f"Students who passed: {passed.sum()}/{n}")
print(f"Pass rate at 2 hours: {expit(-3 + 0.8 * 2):.1%}")
print(f"Pass rate at 8 hours: {expit(-3 + 0.8 * 8):.1%}")Students who passed: 142/200
Pass rate at 2 hours: 19.8%
Pass rate at 8 hours: 96.8%
If we tried to fit this with OLS, the model would happily predict probabilities below 0 for students who barely study and above 1 for students who study a lot. We need a link function to keep predictions on the right scale.
Link functions¶
The idea¶
A link function maps the expected response to the linear predictor :
Identity: (standard regression --- no transformation)
Logit: (binary outcomes --- maps (0, 1) to all reals)
Log: (count data --- maps positive reals to all reals)
The inverse link maps back: given any linear predictor , the inverse link produces a valid expected response . For the logit link, the inverse is the logistic function , which always returns a value between 0 and 1.
from bossanova.internal.maths.family import logit_link, logit_link_inverse
probs = np.array([0.1, 0.25, 0.5, 0.75, 0.9])
etas = logit_link(probs)
back = logit_link_inverse(etas)
for p, e, b in zip(probs, etas, back):
print(f" p={p:.2f} → η={e:.4f} → back to p={b:.2f}") p=0.10 → η=-2.1972 → back to p=0.10
p=0.25 → η=-1.0986 → back to p=0.25
p=0.50 → η=0.0000 → back to p=0.50
p=0.75 → η=1.0986 → back to p=0.75
p=0.90 → η=2.1972 → back to p=0.90
Weighted normal equations¶
The problem¶
In GLMs, observations have unequal variance. For binary data, the variance of an observation with predicted probability is --- observations near 0.5 are more variable than those near 0 or 1. We need weighted least squares to account for this.
Building intuition¶
At each iteration of IRLS, we solve a weighted version of the normal equations:
where is a diagonal weight matrix derived from the current fitted values and is the “working response” --- a linearized version of the actual response. This is the same orthogonality condition as in OLS, but now weighted by to account for the heterogeneous variance across observations.
The formal statement¶
Theorem WLS.1: At the WLS solution, .
# Manual WLS: beta = (X'WX)^{-1} X'Wy
X = np.column_stack([np.ones(n), hours])
mu_init = np.full(n, passed.mean()) # initial guess
W = np.diag(mu_init * (1 - mu_init)) # binomial variance
XtWX = X.T @ W @ X
XtWy = X.T @ W @ passed
beta_wls = np.linalg.solve(XtWX, XtWy)
residuals = passed - X @ beta_wls
weighted_score = X.T @ W @ residuals
print(f"Weighted score (should be ~0): {weighted_score}")Weighted score (should be ~0): [-0. -0.]
How does this relate to OLS?
This is the same orthogonality condition as OLS.1, but weighted. The weights account for heterogeneous variance: observations with higher variance get less weight. When all weights are equal (), this reduces exactly to the ordinary normal equations .
IRLS: iterating to the solution¶
The problem¶
Unlike OLS, GLMs don’t have a closed-form solution. The weights depend on the fitted values, which depend on the coefficients, which depend on the weights. We need to iterate.
Building intuition¶
IRLS works by repeatedly:
Compute the current fitted values and their derivatives
Construct working weights and working response
Solve the weighted least squares problem
Update and check for convergence
Each step is just a weighted regression --- something we already know how to do. The key insight is that by updating the weights and working response at each step, we’re climbing the likelihood surface toward the maximum likelihood estimate.
# Simple IRLS for logistic regression
beta = np.zeros(2)
for iteration in range(8):
eta = X @ beta
mu = expit(eta)
w = mu * (1 - mu)
W = np.diag(w)
z = eta + (passed - mu) / w # working response
beta_new = np.linalg.solve(X.T @ W @ X, X.T @ W @ z)
# Deviance (negative log-likelihood * 2)
mu_clipped = np.clip(mu, 1e-10, 1 - 1e-10)
deviance = -2 * np.sum(passed * np.log(mu_clipped) + (1 - passed) * np.log(1 - mu_clipped))
change = np.max(np.abs(beta_new - beta))
print(f" Iteration {iteration}: deviance={deviance:.2f}, max|Δβ|={change:.6f}")
beta = beta_new
if change < 1e-8:
print(f" Converged!")
break Iteration 0: deviance=277.26, max|Δβ|=1.395698
Iteration 1: deviance=167.52, max|Δβ|=0.759401
Iteration 2: deviance=151.18, max|Δβ|=0.453346
Iteration 3: deviance=148.25, max|Δβ|=0.133020
Iteration 4: deviance=148.09, max|Δβ|=0.009093
Iteration 5: deviance=148.09, max|Δβ|=0.000040
Iteration 6: deviance=148.09, max|Δβ|=0.000000
Converged!
Score equations at convergence¶
The problem¶
How do we know IRLS found the right answer? At the maximum likelihood estimate, the gradient of the log-likelihood (called the score) must be zero.
Building intuition¶
For a GLM with canonical link, the score has a remarkably simple form: . This says the residuals must be orthogonal to every predictor column --- the same geometric condition as OLS, but now on the probability scale rather than the response scale.
The formal statement¶
Theorem WLS.3: At convergence for a GLM with canonical link, .
mu_final = expit(X @ beta)
score = X.T @ (passed - mu_final)
print(f"Score at convergence: {score}")
print(f"Max |score|: {np.max(np.abs(score)):.2e}")Score at convergence: [0. 0.]
Max |score|: 1.96e-13
Why does the weight matrix disappear?
For canonical links (logit for binomial, log for Poisson), the weight matrix in the score equations cancels out, giving the elegant form . This is a self-consistency condition: the predicted means are consistent with the data in the sense that no predictor can explain any remaining systematic pattern in the residuals.
Seeing it in bossanova¶
from bossanova.internal.maths.family import binomial
from bossanova.internal.maths.solvers import fit_glm_irls
fam = binomial()
result = fit_glm_irls(passed, X, weights=np.ones(n), family=fam)
print(f"Coefficients: {result['coef']}")
print(f"Converged: {result['converged']}")
print(f"Iterations: {result['n_iter']}")
# Verify score equations
mu_bn = fam.link_inverse(X @ result["coef"])
print(f"Score at convergence: {X.T @ (passed - mu_bn)}")Coefficients: [-2.7506 0.8444]
Converged: True
Iterations: 6
Score at convergence: [-0. 0.]
Summary¶
| Theorem | Statement | What it means |
|---|---|---|
| WLS.1 | Weighted residual orthogonality | |
| WLS.3 | at convergence | Score equations satisfied |