bossanova.expressions

Polars expression helpers for common statistical transforms.

This module provides stateless Polars expressions for data transformations commonly used in statistical modeling. These match the semantics of the formula-based transforms in formula.

Functions:

Name	Description
`center`	Mean-centering: x - mean(x).
`lag`	Lag (shift backward) values by n positions.
`lead`	Lead (shift forward) values by n positions.
`norm`	Normalize magnitude: x / std(x).
`rank`	Rank values using the average method (matches R’s default rank()).
`scale`	Gelman scaling: (x - mean(x)) / (2 * std(x)).
`to_expr`	Convert string column name to Polars expression.
`winsorize`	Winsorize values by capping at percentiles.
`zscore`	Z-score standardization: (x - mean(x)) / std(x).

Core Transforms:

center(col) — subtract mean only
norm(col) — divide by std only
zscore(col) — traditional z-score (1 SD)
scale(col) — Gelman scaling (2 SD)

Additional Helpers:

rank(col, method) — average-method rank transform
winsorize(col, lower, upper) — percentile capping
lag(col, n) — lagged values
lead(col, n) — lead values

The scale() transform follows Gelman (2008) recommendation to divide by 2 standard deviations, making continuous predictor coefficients directly comparable to binary predictor coefficients in regression models.

Note: These are stateless expressions. For zero-variance columns, norm(), zscore(), and scale() will produce inf/NaN values. Handle these in your pipeline if needed.

Examples:

import polars as pl
from bossanova.expressions import center, zscore

df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})
df.with_columns(
    center("x").alias("x_centered"),
    zscore("x").alias("x_zscore"),
)
# shape: (5, 3)
# ┌─────┬────────────┬──────────┐
# │ x   ┆ x_centered ┆ x_zscore │
# │ --- ┆ ---        ┆ ---      │
# │ f64 ┆ f64        ┆ f64      │
# ╞═════╪════════════╪══════════╡
# │ 1.0 ┆ -2.0       ┆ -1.26... │
# │ 2.0 ┆ -1.0       ┆ -0.63... │
# │ 3.0 ┆ 0.0        ┆ 0.0      │
# │ 4.0 ┆ 1.0        ┆ 0.63...  │
# │ 5.0 ┆ 2.0        ┆ 1.26...  │
# └─────┴────────────┴──────────┘

Functions¶

center¶

center(col: str | pl.Expr) -> pl.Expr

Mean-centering: x - mean(x).

Subtracts the column mean from each value. Useful for interpretable intercepts in regression models.

Parameters:

Name	Type	Description	Default
`col`	`str \| Expr`	Column name or Polars expression.	required

Returns:

Type	Description
`Expr`	Polars expression computing centered values.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [10.0, 20.0, 30.0]})
df.with_columns(expr.center("x").alias("x_centered"))
# shape: (3, 2)
# ┌──────┬────────────┐
# │ x    ┆ x_centered │
# │ ---  ┆ ---        │
# │ f64  ┆ f64        │
# ╞══════╪════════════╡
# │ 10.0 ┆ -10.0      │
# │ 20.0 ┆ 0.0        │
# │ 30.0 ┆ 10.0       │
# └──────┴────────────┘

lag¶

lag(col: str | pl.Expr, n: int = 1) -> pl.Expr

Lag (shift backward) values by n positions.

Shifts values down, inserting nulls at the beginning. Useful for time series analysis and creating lagged predictors.

Parameters:

Name	Type	Description	Default
`col`	`str \| Expr`	Column name or Polars expression.	required
`n`	`int`	Number of positions to shift. Default 1.	`1`

Returns:

Type	Description
`Expr`	Polars expression computing lagged values.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
df.with_columns(expr.lag("x", 2).alias("x_lag2"))
# shape: (5, 2)
# ┌─────┬────────┐
# │ x   ┆ x_lag2 │
# │ --- ┆ ---    │
# │ i64 ┆ i64    │
# ╞═════╪════════╡
# │ 1   ┆ null   │
# │ 2   ┆ null   │
# │ 3   ┆ 1      │
# │ 4   ┆ 2      │
# │ 5   ┆ 3      │
# └─────┴────────┘

lead¶

lead(col: str | pl.Expr, n: int = 1) -> pl.Expr

Lead (shift forward) values by n positions.

Shifts values up, inserting nulls at the end. Useful for time series analysis and creating lead predictors.

Parameters:

Name	Type	Description	Default
`col`	`str \| Expr`	Column name or Polars expression.	required
`n`	`int`	Number of positions to shift. Default 1.	`1`

Returns:

Type	Description
`Expr`	Polars expression computing lead values.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
df.with_columns(expr.lead("x", 2).alias("x_lead2"))
# shape: (5, 2)
# ┌─────┬─────────┐
# │ x   ┆ x_lead2 │
# │ --- ┆ ---     │
# │ i64 ┆ i64     │
# ╞═════╪═════════╡
# │ 1   ┆ 3       │
# │ 2   ┆ 4       │
# │ 3   ┆ 5       │
# │ 4   ┆ null    │
# │ 5   ┆ null    │
# └─────┴─────────┘
# 



### norm

```python
norm(col: str | pl.Expr) -> pl.Expr

Normalize magnitude: x / std(x).

Divides by the standard deviation without centering. Useful when you want to preserve the mean but normalize variance.

Parameters:

Name	Type	Description	Default
`col`	`str \| Expr`	Column name or Polars expression.	required

Returns:

Type	Description
`Expr`	Polars expression computing normalized values.

Note: Returns inf for zero-variance columns. Uses sample std (ddof=1).

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [0.0, 10.0, 20.0]})
df.with_columns(expr.norm("x").alias("x_norm"))
# shape: (3, 2)
# ┌──────┬────────┐
# │ x    ┆ x_norm │
# │ ---  ┆ ---    │
# │ f64  ┆ f64    │
# ╞══════╪════════╡
# │ 0.0  ┆ 0.0    │
# │ 10.0 ┆ 1.0    │
# │ 20.0 ┆ 2.0    │
# └──────┴────────┘

rank¶

rank(col: str | pl.Expr, method: str = 'average') -> pl.Expr

Rank values using the average method (matches R’s default rank()).

Assigns ranks to values, with tied values receiving the average of their ranks. Useful for rank-based regression or non-parametric transforms.

Parameters:

Name	Type	Description	Default
`col`	`str \| Expr`	Column name or Polars expression.	required
`method`	`str`	Ranking method. One of “average” (default), “min”, “max”, “dense”, “ordinal”. Default “average” matches R’s rank().	`‘average’`

Returns:

Type	Description
`Expr`	Polars expression computing ranked values.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [3.0, 1.0, 2.0, 2.0, 5.0]})
df.with_columns(expr.rank("x").alias("x_rank"))
# shape: (5, 2)
# ┌─────┬────────┐
# │ x   ┆ x_rank │
# │ --- ┆ ---    │
# │ f64 ┆ f64    │
# ╞═════╪════════╡
# │ 3.0 ┆ 4.0    │
# │ 1.0 ┆ 1.0    │
# │ 2.0 ┆ 2.5    │
# │ 2.0 ┆ 2.5    │
# │ 5.0 ┆ 5.0    │
# └─────┴────────┘

scale¶

scale(col: str | pl.Expr) -> pl.Expr

Gelman scaling: (x - mean(x)) / (2 * std(x)).

Standardization dividing by 2 standard deviations, following Gelman (2008). This makes continuous predictor coefficients directly comparable to binary predictor coefficients (which span their full range of 0 to 1).

Parameters:

Name	Type	Description	Default
`col`	`str \| Expr`	Column name or Polars expression.	required

Returns:

Type	Description
`Expr`	Polars expression computing Gelman-scaled values.

Note: Returns NaN for zero-variance columns. Uses sample std (ddof=1).

Gelman, A. (2008). Scaling regression inputs by dividing by two Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine, 27(15), 2865-2873.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})
df.with_columns(expr.scale("x").alias("x_scaled"))
# shape: (5, 2)
# ┌─────┬───────────┐
# │ x   ┆ x_scaled  │
# │ --- ┆ ---       │
# │ f64 ┆ f64       │
# ╞═════╪═══════════╡
# │ 1.0 ┆ -0.632456 │
# │ 2.0 ┆ -0.316228 │
# │ 3.0 ┆ 0.0       │
# │ 4.0 ┆ 0.316228  │
# │ 5.0 ┆ 0.632456  │
# └─────┴───────────┘

to_expr¶

to_expr(col: str | pl.Expr) -> pl.Expr

Convert string column name to Polars expression.

winsorize¶

winsorize(col: str | pl.Expr, lower: float = 0.01, upper: float = 0.99) -> pl.Expr

Winsorize values by capping at percentiles.

Clips extreme values to specified percentile bounds. Useful for reducing the influence of outliers without removing observations.

Parameters:

Name	Type	Description	Default
`col`	`str \| Expr`	Column name or Polars expression.	required
`lower`	`float`	Lower percentile bound (0-1). Default 0.01 (1st percentile).	`0.01`
`upper`	`float`	Upper percentile bound (0-1). Default 0.99 (99th percentile).	`0.99`

Returns:

Type	Description
`Expr`	Polars expression computing winsorized values.

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 100.0]})
df.with_columns(expr.winsorize("x", lower=0.1, upper=0.9).alias("x_wins"))
# shape: (5, 2)
# ┌───────┬────────┐
# │ x     ┆ x_wins │
# │ ---   ┆ ---    │
# │ f64   ┆ f64    │
# ╞═══════╪════════╡
# │ 1.0   ┆ 1.4    │
# │ 2.0   ┆ 2.0    │
# │ 3.0   ┆ 3.0    │
# │ 4.0   ┆ 4.0    │
# │ 100.0 ┆ 61.6   │
# └───────┴────────┘

zscore¶

zscore(col: str | pl.Expr) -> pl.Expr

Z-score standardization: (x - mean(x)) / std(x).

Traditional standardization producing values with mean=0 and std=1. Coefficients represent the effect of a 1 standard deviation change.

Parameters:

Name	Type	Description	Default
`col`	`str \| Expr`	Column name or Polars expression.	required

Returns:

Type	Description
`Expr`	Polars expression computing z-scored values.

Note: Returns NaN for zero-variance columns. Uses sample std (ddof=1).

Examples:

import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})
df.with_columns(expr.zscore("x").alias("x_z"))
# shape: (5, 2)
# ┌─────┬───────────┐
# │ x   ┆ x_z       │
# │ --- ┆ ---       │
# │ f64 ┆ f64       │
# ╞═════╪═══════════╡
# │ 1.0 ┆ -1.264911 │
# │ 2.0 ┆ -0.632456 │
# │ 3.0 ┆ 0.0       │
# │ 4.0 ┆ 0.632456  │
# │ 5.0 ┆ 1.264911  │
# └─────┴───────────┘