Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

bossanova.expressions

UCSD Psychology

Polars expression helpers for common statistical transforms.

This module provides stateless Polars expressions for data transformations commonly used in statistical modeling. These match the semantics of the formula-based transforms in formula.

Functions:

NameDescription
centerMean-centering: x - mean(x).
lagLag (shift backward) values by n positions.
leadLead (shift forward) values by n positions.
normNormalize magnitude: x / std(x).
rankRank values using the average method (matches R’s default rank()).
scaleGelman scaling: (x - mean(x)) / (2 * std(x)).
to_exprConvert string column name to Polars expression.
winsorizeWinsorize values by capping at percentiles.
zscoreZ-score standardization: (x - mean(x)) / std(x).

Core Transforms:

Additional Helpers:

The scale() transform follows Gelman (2008) recommendation to divide by 2 standard deviations, making continuous predictor coefficients directly comparable to binary predictor coefficients in regression models.

Note: These are stateless expressions. For zero-variance columns, norm(), zscore(), and scale() will produce inf/NaN values. Handle these in your pipeline if needed.

Examples:

import polars as pl
from bossanova.expressions import center, zscore

df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})
df.with_columns(
    center("x").alias("x_centered"),
    zscore("x").alias("x_zscore"),
)
# shape: (5, 3)
# ┌─────┬────────────┬──────────┐
# │ x   ┆ x_centered ┆ x_zscore │
# │ --- ┆ ---        ┆ ---      │
# │ f64 ┆ f64        ┆ f64      │
# ╞═════╪════════════╪══════════╡
# │ 1.0 ┆ -2.0       ┆ -1.26... │
# │ 2.0 ┆ -1.0       ┆ -0.63... │
# │ 3.0 ┆ 0.0        ┆ 0.0      │
# │ 4.0 ┆ 1.0        ┆ 0.63...  │
# │ 5.0 ┆ 2.0        ┆ 1.26...  │
# └─────┴────────────┴──────────┘

Functions

center

center(col: str | pl.Expr) -> pl.Expr

Mean-centering: x - mean(x).

Subtracts the column mean from each value. Useful for interpretable intercepts in regression models.

Parameters:

NameTypeDescriptionDefault
colstr | ExprColumn name or Polars expression.required

Returns:

TypeDescription
ExprPolars expression computing centered values.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [10.0, 20.0, 30.0]})
df.with_columns(expr.center("x").alias("x_centered"))
# shape: (3, 2)
# ┌──────┬────────────┐
# │ x    ┆ x_centered │
# │ ---  ┆ ---        │
# │ f64  ┆ f64        │
# ╞══════╪════════════╡
# │ 10.0 ┆ -10.0      │
# │ 20.0 ┆ 0.0        │
# │ 30.0 ┆ 10.0       │
# └──────┴────────────┘

lag

lag(col: str | pl.Expr, n: int = 1) -> pl.Expr

Lag (shift backward) values by n positions.

Shifts values down, inserting nulls at the beginning. Useful for time series analysis and creating lagged predictors.

Parameters:

NameTypeDescriptionDefault
colstr | ExprColumn name or Polars expression.required
nintNumber of positions to shift. Default 1.1

Returns:

TypeDescription
ExprPolars expression computing lagged values.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
df.with_columns(expr.lag("x", 2).alias("x_lag2"))
# shape: (5, 2)
# ┌─────┬────────┐
# │ x   ┆ x_lag2 │
# │ --- ┆ ---    │
# │ i64 ┆ i64    │
# ╞═════╪════════╡
# │ 1   ┆ null   │
# │ 2   ┆ null   │
# │ 3   ┆ 1      │
# │ 4   ┆ 2      │
# │ 5   ┆ 3      │
# └─────┴────────┘

lead

lead(col: str | pl.Expr, n: int = 1) -> pl.Expr

Lead (shift forward) values by n positions.

Shifts values up, inserting nulls at the end. Useful for time series analysis and creating lead predictors.

Parameters:

NameTypeDescriptionDefault
colstr | ExprColumn name or Polars expression.required
nintNumber of positions to shift. Default 1.1

Returns:

TypeDescription
ExprPolars expression computing lead values.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
df.with_columns(expr.lead("x", 2).alias("x_lead2"))
# shape: (5, 2)
# ┌─────┬─────────┐
# │ x   ┆ x_lead2 │
# │ --- ┆ ---     │
# │ i64 ┆ i64     │
# ╞═════╪═════════╡
# │ 1   ┆ 3       │
# │ 2   ┆ 4       │
# │ 3   ┆ 5       │
# │ 4   ┆ null    │
# │ 5   ┆ null    │
# └─────┴─────────┘
# 



### norm

```python
norm(col: str | pl.Expr) -> pl.Expr

Normalize magnitude: x / std(x).

Divides by the standard deviation without centering. Useful when you want to preserve the mean but normalize variance.

Parameters:

NameTypeDescriptionDefault
colstr | ExprColumn name or Polars expression.required

Returns:

TypeDescription
ExprPolars expression computing normalized values.

Note: Returns inf for zero-variance columns. Uses sample std (ddof=1).

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [0.0, 10.0, 20.0]})
df.with_columns(expr.norm("x").alias("x_norm"))
# shape: (3, 2)
# ┌──────┬────────┐
# │ x    ┆ x_norm │
# │ ---  ┆ ---    │
# │ f64  ┆ f64    │
# ╞══════╪════════╡
# │ 0.0  ┆ 0.0    │
# │ 10.0 ┆ 1.0    │
# │ 20.0 ┆ 2.0    │
# └──────┴────────┘

rank

rank(col: str | pl.Expr, method: str = 'average') -> pl.Expr

Rank values using the average method (matches R’s default rank()).

Assigns ranks to values, with tied values receiving the average of their ranks. Useful for rank-based regression or non-parametric transforms.

Parameters:

NameTypeDescriptionDefault
colstr | ExprColumn name or Polars expression.required
methodstrRanking method. One of “average” (default), “min”, “max”, “dense”, “ordinal”. Default “average” matches R’s rank().‘average’

Returns:

TypeDescription
ExprPolars expression computing ranked values.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [3.0, 1.0, 2.0, 2.0, 5.0]})
df.with_columns(expr.rank("x").alias("x_rank"))
# shape: (5, 2)
# ┌─────┬────────┐
# │ x   ┆ x_rank │
# │ --- ┆ ---    │
# │ f64 ┆ f64    │
# ╞═════╪════════╡
# │ 3.0 ┆ 4.0    │
# │ 1.0 ┆ 1.0    │
# │ 2.0 ┆ 2.5    │
# │ 2.0 ┆ 2.5    │
# │ 5.0 ┆ 5.0    │
# └─────┴────────┘

scale

scale(col: str | pl.Expr) -> pl.Expr

Gelman scaling: (x - mean(x)) / (2 * std(x)).

Standardization dividing by 2 standard deviations, following Gelman (2008). This makes continuous predictor coefficients directly comparable to binary predictor coefficients (which span their full range of 0 to 1).

Parameters:

NameTypeDescriptionDefault
colstr | ExprColumn name or Polars expression.required

Returns:

TypeDescription
ExprPolars expression computing Gelman-scaled values.

Note: Returns NaN for zero-variance columns. Uses sample std (ddof=1).

Gelman, A. (2008). Scaling regression inputs by dividing by two Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine, 27(15), 2865-2873.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})
df.with_columns(expr.scale("x").alias("x_scaled"))
# shape: (5, 2)
# ┌─────┬───────────┐
# │ x   ┆ x_scaled  │
# │ --- ┆ ---       │
# │ f64 ┆ f64       │
# ╞═════╪═══════════╡
# │ 1.0 ┆ -0.632456 │
# │ 2.0 ┆ -0.316228 │
# │ 3.0 ┆ 0.0       │
# │ 4.0 ┆ 0.316228  │
# │ 5.0 ┆ 0.632456  │
# └─────┴───────────┘

to_expr

to_expr(col: str | pl.Expr) -> pl.Expr

Convert string column name to Polars expression.

winsorize

winsorize(col: str | pl.Expr, lower: float = 0.01, upper: float = 0.99) -> pl.Expr

Winsorize values by capping at percentiles.

Clips extreme values to specified percentile bounds. Useful for reducing the influence of outliers without removing observations.

Parameters:

NameTypeDescriptionDefault
colstr | ExprColumn name or Polars expression.required
lowerfloatLower percentile bound (0-1). Default 0.01 (1st percentile).0.01
upperfloatUpper percentile bound (0-1). Default 0.99 (99th percentile).0.99

Returns:

TypeDescription
ExprPolars expression computing winsorized values.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 100.0]})
df.with_columns(expr.winsorize("x", lower=0.1, upper=0.9).alias("x_wins"))
# shape: (5, 2)
# ┌───────┬────────┐
# │ x     ┆ x_wins │
# │ ---   ┆ ---    │
# │ f64   ┆ f64    │
# ╞═══════╪════════╡
# │ 1.0   ┆ 1.4    │
# │ 2.0   ┆ 2.0    │
# │ 3.0   ┆ 3.0    │
# │ 4.0   ┆ 4.0    │
# │ 100.0 ┆ 61.6   │
# └───────┴────────┘

zscore

zscore(col: str | pl.Expr) -> pl.Expr

Z-score standardization: (x - mean(x)) / std(x).

Traditional standardization producing values with mean=0 and std=1. Coefficients represent the effect of a 1 standard deviation change.

Parameters:

NameTypeDescriptionDefault
colstr | ExprColumn name or Polars expression.required

Returns:

TypeDescription
ExprPolars expression computing z-scored values.

Note: Returns NaN for zero-variance columns. Uses sample std (ddof=1).

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})
df.with_columns(expr.zscore("x").alias("x_z"))
# shape: (5, 2)
# ┌─────┬───────────┐
# │ x   ┆ x_z       │
# │ --- ┆ ---       │
# │ f64 ┆ f64       │
# ╞═════╪═══════════╡
# │ 1.0 ┆ -1.264911 │
# │ 2.0 ┆ -0.632456 │
# │ 3.0 ┆ 0.0       │
# │ 4.0 ┆ 0.632456  │
# │ 5.0 ┆ 1.264911  │
# └─────┴───────────┘