Polars expression helpers for common statistical transforms.
This module provides stateless Polars expressions for data transformations
commonly used in statistical modeling. These match the semantics of the
formula-based transforms in formula.
Functions:
| Name | Description |
|---|---|
center | Mean-centering: x - mean(x). |
lag | Lag (shift backward) values by n positions. |
lead | Lead (shift forward) values by n positions. |
norm | Normalize magnitude: x / std(x). |
rank | Rank values using the average method (matches R’s default rank()). |
scale | Gelman scaling: (x - mean(x)) / (2 * std(x)). |
to_expr | Convert string column name to Polars expression. |
winsorize | Winsorize values by capping at percentiles. |
zscore | Z-score standardization: (x - mean(x)) / std(x). |
Core Transforms:
center(col)— subtract mean onlynorm(col)— divide by std onlyzscore(col)— traditional z-score (1 SD)scale(col)— Gelman scaling (2 SD)
Additional Helpers:
rank(col, method)— average-method rank transformwinsorize(col, lower, upper)— percentile cappinglag(col, n)— lagged valueslead(col, n)— lead values
The scale() transform follows Gelman (2008) recommendation to divide by 2 standard deviations, making continuous predictor coefficients directly comparable to binary predictor coefficients in regression models.
Note: These are stateless expressions. For zero-variance columns, norm(), zscore(), and scale() will produce inf/NaN values. Handle these in your pipeline if needed.
Examples:
import polars as pl
from bossanova.expressions import center, zscore
df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})
df.with_columns(
center("x").alias("x_centered"),
zscore("x").alias("x_zscore"),
)
# shape: (5, 3)
# ┌─────┬────────────┬──────────┐
# │ x ┆ x_centered ┆ x_zscore │
# │ --- ┆ --- ┆ --- │
# │ f64 ┆ f64 ┆ f64 │
# ╞═════╪════════════╪══════════╡
# │ 1.0 ┆ -2.0 ┆ -1.26... │
# │ 2.0 ┆ -1.0 ┆ -0.63... │
# │ 3.0 ┆ 0.0 ┆ 0.0 │
# │ 4.0 ┆ 1.0 ┆ 0.63... │
# │ 5.0 ┆ 2.0 ┆ 1.26... │
# └─────┴────────────┴──────────┘Functions¶
center¶
center(col: str | pl.Expr) -> pl.ExprMean-centering: x - mean(x).
Subtracts the column mean from each value. Useful for interpretable intercepts in regression models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col | str | Expr | Column name or Polars expression. | required |
Returns:
| Type | Description |
|---|---|
Expr | Polars expression computing centered values. |
Examples:
import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [10.0, 20.0, 30.0]})
df.with_columns(expr.center("x").alias("x_centered"))
# shape: (3, 2)
# ┌──────┬────────────┐
# │ x ┆ x_centered │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞══════╪════════════╡
# │ 10.0 ┆ -10.0 │
# │ 20.0 ┆ 0.0 │
# │ 30.0 ┆ 10.0 │
# └──────┴────────────┘lag¶
lag(col: str | pl.Expr, n: int = 1) -> pl.ExprLag (shift backward) values by n positions.
Shifts values down, inserting nulls at the beginning. Useful for time series analysis and creating lagged predictors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col | str | Expr | Column name or Polars expression. | required |
n | int | Number of positions to shift. Default 1. | 1 |
Returns:
| Type | Description |
|---|---|
Expr | Polars expression computing lagged values. |
Examples:
import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
df.with_columns(expr.lag("x", 2).alias("x_lag2"))
# shape: (5, 2)
# ┌─────┬────────┐
# │ x ┆ x_lag2 │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪════════╡
# │ 1 ┆ null │
# │ 2 ┆ null │
# │ 3 ┆ 1 │
# │ 4 ┆ 2 │
# │ 5 ┆ 3 │
# └─────┴────────┘lead¶
lead(col: str | pl.Expr, n: int = 1) -> pl.ExprLead (shift forward) values by n positions.
Shifts values up, inserting nulls at the end. Useful for time series analysis and creating lead predictors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col | str | Expr | Column name or Polars expression. | required |
n | int | Number of positions to shift. Default 1. | 1 |
Returns:
| Type | Description |
|---|---|
Expr | Polars expression computing lead values. |
Examples:
import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
df.with_columns(expr.lead("x", 2).alias("x_lead2"))
# shape: (5, 2)
# ┌─────┬─────────┐
# │ x ┆ x_lead2 │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════════╡
# │ 1 ┆ 3 │
# │ 2 ┆ 4 │
# │ 3 ┆ 5 │
# │ 4 ┆ null │
# │ 5 ┆ null │
# └─────┴─────────┘
#
### norm
```python
norm(col: str | pl.Expr) -> pl.ExprNormalize magnitude: x / std(x).
Divides by the standard deviation without centering. Useful when you want to preserve the mean but normalize variance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col | str | Expr | Column name or Polars expression. | required |
Returns:
| Type | Description |
|---|---|
Expr | Polars expression computing normalized values. |
Note: Returns inf for zero-variance columns. Uses sample std (ddof=1).
Examples:
import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [0.0, 10.0, 20.0]})
df.with_columns(expr.norm("x").alias("x_norm"))
# shape: (3, 2)
# ┌──────┬────────┐
# │ x ┆ x_norm │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞══════╪════════╡
# │ 0.0 ┆ 0.0 │
# │ 10.0 ┆ 1.0 │
# │ 20.0 ┆ 2.0 │
# └──────┴────────┘rank¶
rank(col: str | pl.Expr, method: str = 'average') -> pl.ExprRank values using the average method (matches R’s default rank()).
Assigns ranks to values, with tied values receiving the average of their ranks. Useful for rank-based regression or non-parametric transforms.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col | str | Expr | Column name or Polars expression. | required |
method | str | Ranking method. One of “average” (default), “min”, “max”, “dense”, “ordinal”. Default “average” matches R’s rank(). | ‘average’ |
Returns:
| Type | Description |
|---|---|
Expr | Polars expression computing ranked values. |
Examples:
import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [3.0, 1.0, 2.0, 2.0, 5.0]})
df.with_columns(expr.rank("x").alias("x_rank"))
# shape: (5, 2)
# ┌─────┬────────┐
# │ x ┆ x_rank │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪════════╡
# │ 3.0 ┆ 4.0 │
# │ 1.0 ┆ 1.0 │
# │ 2.0 ┆ 2.5 │
# │ 2.0 ┆ 2.5 │
# │ 5.0 ┆ 5.0 │
# └─────┴────────┘scale¶
scale(col: str | pl.Expr) -> pl.ExprGelman scaling: (x - mean(x)) / (2 * std(x)).
Standardization dividing by 2 standard deviations, following Gelman (2008). This makes continuous predictor coefficients directly comparable to binary predictor coefficients (which span their full range of 0 to 1).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col | str | Expr | Column name or Polars expression. | required |
Returns:
| Type | Description |
|---|---|
Expr | Polars expression computing Gelman-scaled values. |
Note: Returns NaN for zero-variance columns. Uses sample std (ddof=1).
Gelman, A. (2008). Scaling regression inputs by dividing by two Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine, 27(15), 2865-2873.
Examples:
import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})
df.with_columns(expr.scale("x").alias("x_scaled"))
# shape: (5, 2)
# ┌─────┬───────────┐
# │ x ┆ x_scaled │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪═══════════╡
# │ 1.0 ┆ -0.632456 │
# │ 2.0 ┆ -0.316228 │
# │ 3.0 ┆ 0.0 │
# │ 4.0 ┆ 0.316228 │
# │ 5.0 ┆ 0.632456 │
# └─────┴───────────┘to_expr¶
to_expr(col: str | pl.Expr) -> pl.ExprConvert string column name to Polars expression.
winsorize¶
winsorize(col: str | pl.Expr, lower: float = 0.01, upper: float = 0.99) -> pl.ExprWinsorize values by capping at percentiles.
Clips extreme values to specified percentile bounds. Useful for reducing the influence of outliers without removing observations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col | str | Expr | Column name or Polars expression. | required |
lower | float | Lower percentile bound (0-1). Default 0.01 (1st percentile). | 0.01 |
upper | float | Upper percentile bound (0-1). Default 0.99 (99th percentile). | 0.99 |
Returns:
| Type | Description |
|---|---|
Expr | Polars expression computing winsorized values. |
Examples:
import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 100.0]})
df.with_columns(expr.winsorize("x", lower=0.1, upper=0.9).alias("x_wins"))
# shape: (5, 2)
# ┌───────┬────────┐
# │ x ┆ x_wins │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═══════╪════════╡
# │ 1.0 ┆ 1.4 │
# │ 2.0 ┆ 2.0 │
# │ 3.0 ┆ 3.0 │
# │ 4.0 ┆ 4.0 │
# │ 100.0 ┆ 61.6 │
# └───────┴────────┘zscore¶
zscore(col: str | pl.Expr) -> pl.ExprZ-score standardization: (x - mean(x)) / std(x).
Traditional standardization producing values with mean=0 and std=1. Coefficients represent the effect of a 1 standard deviation change.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col | str | Expr | Column name or Polars expression. | required |
Returns:
| Type | Description |
|---|---|
Expr | Polars expression computing z-scored values. |
Note: Returns NaN for zero-variance columns. Uses sample std (ddof=1).
Examples:
import polars as pl
from bossanova import expressions as expr
df = pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})
df.with_columns(expr.zscore("x").alias("x_z"))
# shape: (5, 2)
# ┌─────┬───────────┐
# │ x ┆ x_z │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪═══════════╡
# │ 1.0 ┆ -1.264911 │
# │ 2.0 ┆ -0.632456 │
# │ 3.0 ┆ 0.0 │
# │ 4.0 ┆ 0.632456 │
# │ 5.0 ┆ 1.264911 │
# └─────┴───────────┘