Glossary

A reference for statistical terminology used throughout this documentation. Use GLM syntax to link to any definition.

Model Types¶

linear model: A model expressing an outcome as a linear combination of predictors plus random error: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ . The foundation of regression analysis. Also known as lm.
GLM: Extension of linear regression to non-Gaussian outcomes (binary, count data) using a link function: $g(\mathbb{E}[y]) = \mathbf{X}\boldsymbol{\beta}$ . Also known as generalized linear model. See Generalized Models.
LMM: A linear model with both population-level and varying effects for clustered or longitudinal data: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{b} + \boldsymbol{\varepsilon}$ . Also known as linear mixed model or lmer. See Mixed Effects.
GLMM: Combines GLM link functions with varying effects for clustered non-Gaussian data. Uses PIRLS and Laplace approximation for estimation. Also known as generalized linear mixed model or glmer. See Generalized Mixed.

Model Components¶

population parameters: Parameters shared across all groups, estimated from the full dataset. In LMMs, these are directly interpretable as population-average effects. In GLMMs, they describe conditional (within-group) effects on the link scale. Also called fixed effects in classical mixed-model literature.
varying effects: Group-specific deviations from population parameters. Modeled as draws from a normal distribution: $\mathbf{b} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma})$ . Each group gets its own offset, shrunk toward zero by the data. Also called random effects in classical literature—but “varying” better conveys what they are: parameters that vary by group.
design matrix: The $n \times p$ numerical encoding of predictors ( $\mathbf{X}$ ) where each row is an observation and each column is a model term. Inspect with m.designmat or m.plot_design().
coefficients: The unknown parameters $\boldsymbol{\beta}$ estimated by fitting the model. Each coefficient represents the expected change in outcome for a one-unit change in its corresponding predictor, holding other predictors constant.
residuals: The difference between observed values and model predictions: $\varepsilon_i = y_i - \hat{y}_i$ . Represent the model’s “remaining ignorance”—variation not explained by predictors.
variance components: Parameters describing the magnitude of varying effects variation. For a varying slopes model, includes variance of intercepts ( $\sigma^2_0$ ), variance of slopes ( $\sigma^2_1$ ), and their correlation ( $\rho$ ).
intercept: The expected value of the outcome when all predictors equal zero. Centering predictors makes the intercept more interpretable (prediction at the mean rather than at zero).

Estimation Methods¶

OLS: Estimation method that finds coefficients minimizing the sum of squared residuals: $\hat{\boldsymbol{\beta}} = \arg\min \sum_i (y_i - \hat{y}_i)^2$ . Has a closed-form solution computed via QR decomposition. Also known as ordinary least squares.
IRLS: Algorithm for fitting GLMs by solving a sequence of weighted least squares problems. At each iteration, constructs a “working response” that linearizes the relationship around current estimates. Also known as iteratively reweighted least squares.
PIRLS: Extension of IRLS for GLMMs that combines IRLS (for the link function) with penalized least squares (for varying effects). The penalty term induces shrinkage. Also known as penalized iteratively reweighted least squares.
REML: Default estimation method for mixed models. Reduces bias in variance component estimates by integrating out the population parameters before maximizing the likelihood. Also known as restricted maximum likelihood.
ML: Alternative to REML for mixed models. Required when comparing models with different population-level predictors via likelihood ratio test. Use fit(method="ML"). Also known as maximum likelihood.
Laplace approximation: Method for approximating the marginal likelihood in GLMMs when exact integration over varying effects is intractable. Expands the integrand around its mode.
QR decomposition: Matrix factorization $\mathbf{X} = \mathbf{Q}\mathbf{R}$ used for numerically stable OLS computation. Avoids explicit matrix inversion.
SVD: Matrix decomposition $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$ that handles rank deficiency gracefully via the pseudoinverse. More robust than QR decomposition for ill-conditioned problems. Also known as singular value decomposition.

Link Functions & Families¶

link function: Function $g(\cdot)$ that maps expected values (which may be bounded) to the linear predictor scale (unbounded). Examples: logit link, log link, identity link.
family: The assumed distribution of the response variable. Common families: gaussian (continuous), binomial (binary), poisson (counts). Each has a default link function.
identity link: Link function $g(\mu) = \mu$ with no transformation. Default for gaussian family. Coefficients directly represent changes in the outcome.
logit link: Link function $g(p) = \log(p/(1-p))$ mapping probability to log-odds. Default for binomial family. Coefficients represent changes in log-odds.
log link: Link function $g(\lambda) = \log(\lambda)$ mapping positive values to the real line. Default for poisson family. Exponentiated coefficients give rate ratios.
gaussian: Family for continuous outcomes assuming normal errors. Uses identity link by default. Equivalent to standard linear model.
binomial: Family for binary outcomes (0/1, success/failure). Uses logit link by default. Models the probability of success.
poisson: Family for count data. Uses log link by default. Assumes variance equals mean (equidispersion).
odds: The ratio $p/(1-p)$ —probability of an event divided by probability of non-event. An odds of 2 means the event is twice as likely as not.
log-odds: Logarithm of the odds. The scale on which logistic regression coefficients operate. Also called the logit.
odds ratio: Exponentiated coefficient in logistic regression: $e^{\beta}$ . Represents multiplicative change in odds for a one-unit increase in the predictor.
rate ratio: Exponentiated coefficient in Poisson regression: $e^{\beta}$ . Represents multiplicative change in expected count for a one-unit increase in the predictor.
dispersion parameter: Scale parameter $\phi$ in GLM variance. Equals 1 for binomial and poisson; estimated for gaussian. Values > 1 indicate overdispersion.

Inference & Uncertainty¶

standard error: Estimated standard deviation of a parameter estimate. Quantifies uncertainty: smaller SE means more precise estimate. Computed from the variance-covariance matrix. Also abbreviated SE.
confidence interval: A range constructed by a procedure that, across repeated sampling, captures the true parameter value at the nominal rate (typically 95%). Any single interval either contains the true value or does not—but in practice, with well-behaved likelihoods, CIs are approximately equivalent to Bayesian credible intervals under flat priors. Also abbreviated CI.
p-value: Probability of observing data as or more extreme than what was observed, assuming the null hypothesis is true. Measures compatibility of the data with the null model—smaller values indicate greater incompatibility. Not the probability that the null is true, and not proof that the alternative is true. Best interpreted as continuous evidence rather than against a fixed threshold.
degrees of freedom: Number of independent pieces of information remaining after estimating model parameters. In simple regression: $df = n - p$ (observations minus coefficients). Each parameter the model estimates “uses up” one degree of freedom, leaving fewer for estimating uncertainty. Determines the reference distribution (t or F) for inference. In mixed models: approximated via Satterthwaite approximation because the effective dimensionality depends on the varying effects structure. Also abbreviated df.
t-statistic: Test statistic $t = \hat{\beta}/\text{SE}(\hat{\beta})$ comparing a coefficient to zero. Under the null, follows a t-distribution with appropriate degrees of freedom.
z-statistic: Test statistic used when degrees of freedom are effectively infinite (large samples, GLMMs). Standard normal distribution under the null.
Wald test: Classical asymptotic test using the ratio of estimate to standard error. The basis for p-values in regression output.
likelihood ratio test: Comparison of nested models based on deviance difference. Tests whether additional parameters significantly improve fit. Requires ML estimation for mixed models. Also abbreviated LRT.
bootstrap inference: Non-parametric inference via resampling with replacement from observed data. Provides confidence intervals without distributional assumptions.
permutation test: Exact hypothesis test constructed by shuffling the outcome while keeping predictors fixed. Tests the null of “no relationship” without distributional assumptions.
Satterthwaite approximation: Method for estimating effective degrees of freedom in mixed models by matching moments of the sampling distribution. Often yields fractional df (e.g., 17.3).
Fisher information: Matrix whose inverse gives the variance-covariance of parameter estimates. In GLMs: $\mathcal{I} = \mathbf{X}^\top \mathbf{W} \mathbf{X}$ where $\mathbf{W}$ contains IRLS weights.

Robust Methods¶

sandwich estimator: heteroscedasticity-consistent variance estimator: $(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\Omega}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}$ . Valid standard errors even when variance is non-constant.
HC0: White’s original sandwich estimator. Appropriate for large samples.
HC1: sandwich estimator with degrees-of-freedom correction. Better for medium samples than HC0.
HC2: sandwich estimator with leverage adjustment. Appropriate when some observations have high influence.
HC3: sandwich estimator with squared leverage adjustment. Most conservative; recommended for small samples. Default when using errors="hetero".
heteroscedasticity: Non-constant variance of errors across observations. Violates classical OLS assumptions, leading to incorrect standard errors. Address with sandwich estimators or weighted regression.
homoscedasticity: Constant variance of errors: $\text{Var}(\varepsilon_i) = \sigma^2$ for all $i$ . A key assumption of classical OLS inference.
Welch-Satterthwaite: degrees of freedom adjustment for comparing groups with unequal variances. Generalizes Welch’s t-test to regression. Use errors="unequal_var".

Marginal Effects & Contrasts¶

marginal effects estimation: Method for translating coefficients into interpretable effects at meaningful reference points. Essential for GLMs where coefficients are on the link function scale. Access via .explore(). Also abbreviated MEE.
estimated marginal means: Predicted values at each factor level, averaging appropriately over other predictors. Also called least-squares means. Obtained via m.explore("factor"). Also abbreviated EMM.
average marginal effect: The effect of a predictor averaged across all observations. For continuous predictors, represents the population-average slope. Obtained via m.explore("continuous_var"). Also abbreviated AME.
conditional effect: Effect of a predictor at specific levels of another predictor. Uses | syntax: m.explore("x | group") gives the slope of x separately for each group level.
contrasts: Comparisons between factor levels. Pairwise contrasts compare all pairs; custom contrasts test specific hypotheses. Use contrasts="pairwise" in .explore().
treatment coding: Default contrast scheme where coefficients represent differences from a reference level (first alphabetically). The intercept is the mean of the reference group.
sum coding: Contrast scheme where coefficients represent deviations from the grand mean. Coefficients sum to zero across levels. Set via sum(factor) in the formula.
population-averaged effects: In mixed models, effects that average over the varying effects distribution. What .explore() returns by default. Contrasts with subject-specific or conditional effects.

Model Comparison & Fit¶

PRE: Fraction of compact model error eliminated by the augmented model: $(\text{Error}_C - \text{Error}_A)/\text{Error}_C$ . Equivalent to R-squared for nested models. Also known as proportional reduction in error.
R-squared: Proportion of variance in the outcome explained by the model: $1 - \text{RSS}/\text{TSS}$ . For OLS with an intercept, ranges from 0 to 1. Can be negative for intercept-free models or out-of-sample predictions. For GLMs, see pseudo-R-squared. Also known as coefficient of determination.
AIC: Model fit measure penalizing complexity: $-2\log L + 2k$ where $k$ is number of parameters. Lower is better. Prefer for prediction-focused comparisons. Also known as Akaike information criterion.
BIC: Model fit measure with stronger complexity penalty: $-2\log L + k\log(n)$ . Lower is better. More conservative than AIC; prefers simpler models. Also known as Bayesian information criterion.
deviance: Measure of model fit: twice the difference in log-likelihood between fitted model and saturated model. Lower deviance indicates better fit. Used in likelihood ratio tests.
compact model: The simpler model in a comparison—fewer predictors or constraints. The null hypothesis assumes the compact model is adequate.
augmented model: The more complex model in a comparison—additional predictors over the compact model. The alternative hypothesis that extra parameters improve fit.
saturated model: A model with as many parameters as observations—perfect fit, zero deviance. The reference point for measuring model fit.
pseudo-R-squared: Analog of R-squared for GLMs where the classical RSS/TSS formula does not apply. Typically computed as $1 - D/D_0$ where $D$ is model deviance and $D_0$ is null deviance. Not directly comparable to OLS R-squared; interpret with caution.

Mixed Model Specific¶

BLUPs: shrinkage estimates of varying effects for each group. Groups with less data are pulled more strongly toward the population mean. Strictly defined for LMMs; in GLMMs these are conditional modes of the varying effects. Also known as best linear unbiased predictors.
shrinkage: Statistical property where extreme or data-sparse estimates are pulled toward a central value. In mixed models, group-specific estimates shrink toward the population mean—a weighted compromise between each group’s data and the overall pattern. Statistically optimal (same principle as James-Stein estimation). Also called partial pooling.
partial pooling: Estimation strategy where group-level estimates are a weighted average between each group’s own data and the overall population estimate. Groups with more data lean toward their own estimate; groups with less data lean toward the population mean. Synonym for shrinkage in mixed models.
ICC: Proportion of total variance that is between groups: $\sigma^2_{\text{between}}/(\sigma^2_{\text{between}} + \sigma^2_{\text{within}})$ . High ICC means observations within groups are strongly correlated. Also known as intraclass correlation.
marginal R-squared: In mixed models, variance explained by population parameters alone. Contrasts with conditional R-squared.
conditional R-squared: In mixed models, variance explained by population-level plus varying effects. A large gap between marginal and conditional $R^2$ indicates varying effects explain substantial variance.
crossed varying effects: Varying effects structure where grouping factors are not nested—every combination can occur. Example: subjects responding to items, with (1|subject) + (1|item). Also called crossed random effects.
nested varying effects: Varying effects structure where groups are hierarchical—lower-level units belong to exactly one higher-level unit. Example: students within schools. Also called nested random effects.
varying intercept: Varying effect allowing each group’s baseline (intercept) to differ from the population average. Specified as (1 | group) in formula. Also called random intercept.
varying slope: Varying effect allowing each group’s effect of a predictor to differ from the population average. Specified as (1 + predictor | group) in formula. Also called random slope.
attenuation: In GLMMs, population-averaged effects are smaller in magnitude than conditional (subject-specific) effects due to Jensen’s inequality applied through the nonlinear link function. Greater varying effects variance produces stronger attenuation. Not present in LMMs (identity link is linear).

Transforms & Parameterization¶

center: Transform subtracting the mean: $x - \bar{x}$ . Makes the intercept interpretable as the prediction at the mean of the predictor. Use center(x) in formulas.
zscore: Traditional standardization: $(x - \bar{x})/s$ . Coefficients represent change per one standard deviation. Use zscore(x) in formulas.
scale: Gelman scaling for comparability with binary predictors: $(x - \bar{x})/(2s)$ . A one-unit change spans the typical range, making effect sizes comparable across predictor types. Use scale(x) in formulas.
norm: Scaling without centering: $x/s$ . Useful when zero is meaningful but you want standardized units. Use norm(x) in formulas.
interaction: Model term allowing the effect of one predictor to depend on another. Specified with * (main effects + interaction) or : (interaction only) in formulas. Creates product columns in the design matrix.

Model Assumptions¶

iid: Independent and identically distributed—errors are both uncorrelated (independence) and share the same variance (homoscedasticity). The classical OLS assumption. Use errors="iid" for standard inference.
independence: Assumption that errors are uncorrelated across observations. Violated by clustered data, repeated measures, or time series. Mixed models (with varying effects) address clustering; time-series methods address autocorrelation.
normality assumption: Assumption that errors follow a normal distribution. Required only for classical inference (p-values, confidence intervals) in small samples. OLS estimates remain unbiased regardless.
linearity in parameters: Requirement that coefficients enter the model linearly. The model $y = \beta_0 + \beta_1 x^2$ is linear in parameters (despite $x^2$ ); the model $y = \beta_0 + \beta_1^2 x$ is not.
multicollinearity: Correlation among predictors causing numerical instability, inflated standard errors, and unstable coefficient estimates. Detected via high condition number or variance inflation factors.
condition number: Measure of numerical stability of the design matrix. High values indicate ill-conditioning—small data changes produce large coefficient changes. Centering predictors often improves conditioning.
rank deficiency: When the design matrix has linearly dependent columns (perfect collinearity). Makes $\mathbf{X}^\top\mathbf{X}$ singular. SVD can handle this via the pseudoinverse.
leverage: Measure of how unusual an observation’s predictor values are. High-leverage points have outsized influence on the fitted line. Diagonal elements of the hat matrix: $h_{ii} = [\mathbf{H}]_{ii}$ .
hat matrix: The projection matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ that maps observed responses to fitted values: $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$ . Diagonal elements are leverage values; $\text{trace}(\mathbf{H}) = p$ (number of parameters). Named because it “puts a hat on $\mathbf{y}$ .”

Foundational Concepts¶

model-based thinking: Approach to statistics where every analysis begins by specifying a model of the data-generating process. Questions like “is there an effect?” become “does a model with this parameter improve predictions compared to a model without it?” This replaces memorized test procedures with a unified framework: specify, fit, compare, check.
aggregation: Statistical compression—discarding individual variation to reveal pattern. The mean discards individual data points; regression discards non-linear relationships. The art is knowing which differences to preserve.
sampling: Logic connecting observed data to a broader population. Enables generalization from a small sample to a larger group of interest. Requires being explicit about what population your sample represents.
overfitting: When a model captures noise along with signal, failing to generalize to new data. Complex models with many parameters are prone to overfitting. Address via cross-validation or regularization.
bias-variance tradeoff: Tension between model simplicity and complexity. Simpler models may miss real patterns (high bias, low variance); more complex models may capture noise (low bias, high variance). Generalization performance depends on finding the right balance for the available data. cross-validation helps navigate this tradeoff.
cross-validation: Evaluating model performance on held-out data to assess generalization. K-fold CV partitions data into K folds, training on K-1 and testing on the remaining fold, rotating through all folds.
Type I error: False positive—rejecting the null hypothesis when it’s actually true. The significance level $\alpha$ (typically 0.05) controls the Type I error rate.
Type II error: False negative—failing to reject the null hypothesis when it’s actually false. statistical power $= 1 -$ Type II error rate.
statistical power: Probability of correctly rejecting a false null hypothesis: $1 -$ Type II error rate. Depends on effect size, sample size, and significance level. Estimate via simulation with .simulate().
effect size: Magnitude of an effect, independent of sample size. Common measures: R-squared, Cohen’s d, odds ratio. Important because statistical significance doesn’t imply practical importance.

Model Types¶

Model Components¶

Estimation Methods¶

Link Functions & Families¶

Inference & Uncertainty¶

Robust Methods¶

Marginal Effects & Contrasts¶

Model Comparison & Fit¶

Mixed Model Specific¶

Transforms & Parameterization¶

Model Assumptions¶

Foundational Concepts¶

See Also¶