Engineering - bossanova

bossanova is as much an engineering experiment as a statistical library. This section documents how and why the codebase is structured the way it is — the architecture, testing philosophy, and development practices that keep it correct and maintainable:

Design Philosophy discusses the broader motivation and goals
Developer Guide covers practical patterns for contributors
Codebase Consistency Audit tracks the strict internal consistency and simplification review
Internal API (top-level section in sidebar) documents the implementation surface

Architecture for Python Developers¶

If you’ve worked with scikit-learn, statsmodels, or PyTorch, you’re used to a particular way of organizing statistical software: classes that own their data and their behavior. A LinearRegression has a .fit() method that writes to self.coef_; a nn.Module holds both parameters and forward logic. This is the OOP pattern — objects are bundles of state and behavior.

bossanova works differently. It follows an architecture inspired by Entity-Component-System (ECS), a pattern from game engines and data-oriented programming. The core idea is simple: separate what something is from what you do with it.

Data and logic live apart¶

In a typical Python ML library, a model class might look like:

# OOP pattern (scikit-learn style)
class LinearRegression:
    def fit(self, X, y):
        self.coef_ = solve(X, y)        # data + logic in one place
        self.residuals_ = y - X @ self.coef_
        return self

    def predict(self, X):
        return X @ self.coef_

    def score(self, X, y):              # yet more logic on the same class
        ...

The model owns the math. Adding a new capability (bootstrap, marginal effects, cross-validation) means adding methods and internal state to this class. Over time, model classes accumulate dozens of methods, hundreds of lines, and tangled internal dependencies.

bossanova inverts this. All data lives in containers — frozen, immutable structs with no behavior beyond validation. All logic lives in operations — pure functions that accept containers and return containers:

# bossanova's ECS-inspired pattern
@frozen
class FitState:                          # Container: just data
    coef: np.ndarray
    vcov: np.ndarray
    residuals: np.ndarray
    ...

def fit_model(spec: ModelSpec, data: DataBundle) -> FitState:    # Operation: just logic
    ...

def compute_emm(spec: ModelSpec, data: DataBundle, fit: FitState, ...) -> MeeState:
    ...

Adding a new analysis technique means writing a new function in the appropriate internal/ domain module, not touching the model class.

The model is a facade¶

Users interact with a single model() class that looks conventional:

m = model("y ~ x + treatment", data).fit().infer()
m.params

But the model class is a facade — a thin orchestration layer that dispatches to internal operations and stores results. It contains no math, no algorithms, no data manipulation. Strip the logic from any model method and what remains is:

check preconditions → call operation → store result → return self

This is a deliberate inversion of the scikit-learn pattern. In scikit-learn, the estimator is the implementation. In bossanova, the model is a coordinator. It knows what to call, not how to compute.

Composition over inheritance¶

Most Python statistical libraries use inheritance to share behavior. Scikit-learn has BaseEstimator → LinearModel → LinearRegression. statsmodels has Model → GenericLikelihoodModel → specific models. This works, but creates coupling: changing a base class method ripples through every subclass.

bossanova uses a single model() class for all model types (lm, glm, lmer, glmer). Type is inferred from the formula and family, not from class identity. Instead of inheriting shared behavior from base classes, all four model types share the same operations:

What changes across model types	Where it lives
Solver algorithm (QR, IRLS, PLS, PIRLS)	`internal/maths/solvers/`
Family/link math (Gaussian, Binomial, ...)	`internal/maths/family/`
Everything else (inference, marginal effects, grids, ...)	Shared across all types

When the inference code improves, all four model types benefit — not because they inherit from a common base, but because they all call the same functions with the same container types.

Why ECS?¶

The ECS pattern comes from game engines (Unity, Bevy, Amethyst), where it solves a different problem: managing thousands of game entities with varying capabilities without class explosion. The statistical analogy is apt: a model has phases (unfitted → fitted → inferred), optional capabilities (marginal effects, prediction, simulation), and data that grows over time (coefficients → standard errors → confidence intervals). An inheritance hierarchy for all combinations would be unwieldy.

ECS gives bossanova three practical properties:

Testability without mocking. Operations are pure functions: construct a container, call the function, assert on the output. No mock.patch, no fixture factories, no dependency injection.
Debuggability. When something goes wrong, inspect the container at each step. Reproduce any bug by constructing the right FitState and calling the right function — no need to set up a full model pipeline.
Extensibility without modification. New analysis techniques are new functions, not new methods or subclasses. The model class doesn’t grow; internal/ domain modules do.

A mental model for contributors¶

If you’re contributing to bossanova, the key shift is: think about the data shape first, not the method signature. Before writing any operation, ask:

What container does this operation accept?
What container does it return?
What fields need to exist? What gets added?

The answers define your function signature. The implementation follows from there. This is the “container-first” workflow described in the Developer Guide.

Human-AI Collaborative Development¶

bossanova is an experiment in modern AI-assisted software development. Despite the hype, it’s no less principled than traditional engineering when done carefully Ozkaya et al., 2024Peng et al., 2023.

AI works best as a collaborative partner, not an autonomous agent. Human expertise supplies critical judgment — statistical requirements, API design, numerical tolerances. AI assists with implementation details that can be verified against those requirements.
Dialogue is Planning. Planning is Dialogue. The planning process matters more for the human than the AI. It forces declarative thinking: What are you trying to achieve? What are your success criteria? How would you verify them? Vague instructions produce vague code. Precise specifications produce testable implementations.
The Independence Test. If tomorrow AI didn’t exist, could a human easily understand how to work with what was built? Is the tool tethered to AI such that development would break without it? If the answer is yes, something has gone wrong. The codebase must be self-explanatory to a reader with no AI assistance.

Artifacts as Knowledge¶

Why invoke anthropology and cognitive science for a Python library? Because the challenge of software development is fundamentally about knowledge artifacts—objects that carry meaning across contexts and time.

Distributed Cognition Hutchins, 1995: Edwin Hutchins studied navigation teams on naval ships and showed that cognition isn’t just “in the head”—it’s distributed across people, tools, and procedures. A navigation chart isn’t just paper; it’s an active participant in the cognitive system. Similarly, bossanova’s test suite isn’t just verification—it’s part of how the system “knows” what correct behavior looks like.

Boundary Objects Star & Griesemer, 1989: Susan Leigh Star and James Griesemer introduced the concept of objects that maintain identity across different communities while being adaptable to local needs. The parity test specifications are exactly this: they mean something specific to the R generator, something different to the Python comparator, and something else to a human reviewer—yet it’s the same artifact enabling coordination.

Making as Knowing Ingold, 2013: Tim Ingold argues that artifacts aren’t inert products but processes of growth and engagement. Code that passes tests isn’t “done”—it’s a living artifact that changes how we understand the problem. Each parity test failure teaches us something about the gap between our mental model and R’s actual implementation.

The implication: tests aren’t just verification. They’re executable knowledge—artifacts that encode understanding and enable its transmission across time and collaborators.

Codebase Layout¶

The ECS-inspired architecture materializes as a hard boundary between internal implementation and user-facing API.

The boundary rule¶

internal/ owns all implementation. Everything outside is user-facing glue.

bossanova/
├── internal/                  # ALL implementation lives here
│   ├── containers/            #   "Entities" — frozen data structs
│   │   ├── structs/           #     Pure frozen data classes
│   │   ├── builders/          #     Smart constructors
│   │   └── validators.py      #     Shared validators/converters
│   ├── design/                #   Design matrix coding (treatment, sum, etc.)
│   ├── formula/               #   Formula parsing, design matrices
│   ├── marginal/              #   EMMs, slopes, contrasts
│   ├── fit/                   #   Model fitting, diagnostics, convergence
│   ├── infer/                 #   Inference (bootstrap, permutation, CV, resample/)
│   ├── compare/               #   Model comparison (LRT, AIC)
│   ├── simulation/            #   Data simulation
│   ├── rendering/             #   Summary display
│   ├── maths/                 #   Pure math (backend-aware)
│   │   ├── backend/           #     JAX/NumPy dispatch (ArrayOps)
│   │   ├── solvers/           #     QR, IRLS, PLS, PIRLS
│   │   ├── linalg/            #     Linear algebra primitives
│   │   ├── inference/         #     Statistical inference math
│   │   ├── family/            #     GLM families and link functions
│   │   ├── distributions/     #     Distribution utilities
│   │   └── rng.py             #     Backend-agnostic RNG
│   └── viz/                   #   Plotting implementations
│
├── model/                     # User-facing — thin orchestrator
├── distributions/             # User-facing — distribution factories
├── data/                      # User-facing — dataset loading
├── expressions.py             # User-facing — formula transforms
└── __init__.py                # Re-exports: compare, viz, lrt

Layer dependencies¶

Dependencies flow in one direction:

containers  ←  domain modules  ←  model (user-facing)
     ↑              ↑
     └──── maths ───┘

Hard rules:

Containers never import from domain modules or maths. They are pure data definitions.
Maths never imports containers. It works with arrays and primitives only (TYPE_CHECKING for type hints).
Domain modules may import from containers and maths. This is the orchestration layer.
Model imports from all internal layers. It is the glue.

Three-Layer Testing¶

Test-driven development is non-negotiable. bossanova uses three complementary verification layers:

Layer	What it catches	Example
Unit tests	Logic errors, regressions	`test_lmer_fit_returns_params()`
Parity tests	Numerical drift from reference	Coefficients match lme4 to 1e-6
Executable docs	API usability, workflow gaps	Tutorial code actually runs

Unit tests verify internal correctness — that functions do what they claim. The most valuable unit tests target internal/ domain modules and internal/maths/ directly: pure functions with clear inputs and outputs.

Parity tests verify external correctness — that results match authoritative implementations. When bossanova and R disagree, we investigate whether it’s a bug, a difference in defaults, or a genuine methodological choice, and document accordingly.

Executable documentation ensures that every tutorial and guide actually works. If the code in the docs doesn’t run, the documentation build fails. This catches API drift, missing imports, and workflow gaps that unit tests miss.

This approach is inspired by the concept of a development harness or eval harness from AI engineering Gao et al., 2023. But we find it more helpful to think of this as verifiable shared knowledge—artifacts that carry their own evidence of correctness.

Theorem-Driven Development¶

Beyond parity testing, bossanova verifies mathematical properties through property-based testing with Hypothesis. Each test encodes a theorem that must hold for any valid input, verified automatically across thousands of random cases.

The Theorem Reference documents 30+ verified properties organized by domain:

Linear Algebra: SVD reconstruction, Moore-Penrose properties, Cholesky log-determinant
OLS: Normal equations, hat matrix properties, projection idempotence
Inference: SE from VCOV, CI symmetry, Wald statistic formula
Marginal Effects: EMM as linear prediction, variance propagation

Each theorem includes a formal statement, an intuition explaining why the property matters, dependencies (which theorems it builds on), and enables (which downstream theorems depend on it).

The dependency graph creates a theorem chain — a formal structure showing how foundational properties (like SVD decomposition) enable higher-level guarantees (like correct standard errors). When a theorem fails, the dependency chain identifies which assumptions broke.

R Parity as Validation¶

bossanova’s numerical results match R’s lme4 within floating-point tolerance (~1e-10 to 1e-6 depending on method). Our goal is simple: when two independent implementations, written in different languages by different people, agree to ten decimal places, you can trust both. Parity testing catches subtle errors that unit tests miss: wrong degrees of freedom, incorrect variance component extraction, numerical instability in edge cases.

R is authoritative because it’s been battle-tested for decades by statisticians who care about correctness. The parity test suite covers all four model types:

pixi run parity          # All fast R parity tests
pixi run parity-lm-r     # LM vs R
pixi run parity-glm-r    # GLM vs R
pixi run parity-lmer-r   # LMER vs R
pixi run parity-glmer-r  # GLMER vs R
pixi run parity-emmeans-r # EMMs vs R
pixi run parity-compare-r # Compare vs R

Dual Environment Support¶

bossanova runs in two environments:

Environment	Backend	Use case
Native Python	JAX (default) or NumPy	Local development, performance
Browser/Pyodide	NumPy only	JupyterLite, marimo, education

All code, architecture, and feature decisions must consider both environments. The backend abstraction (internal/maths/backend/) provides a unified API via ArrayOps. Code using ops.np, ops.qr(), ops.jit() works in both environments automatically.

JAX provides JIT compilation and hardware acceleration for hot loops (IRLS, resampling) where compilation provides 2-4x speedup. But it’s never used for the entire workflow — compilation overhead (~100ms) makes it counterproductive for single-call operations.

NumPy provides Pyodide compatibility for browser-based environments. The RNG abstraction (internal/maths/rng.py) ensures reproducibility across both backends. JAX is never imported unconditionally — all JAX imports use try/except with NumPy fallbacks.

The dual-backend constraint is one of the most impactful architectural decisions in the codebase. It forces clean separation between algorithm structure and array operations, which in turn makes the code more portable and testable.

References¶

Ozkaya, I., Walkinshaw, N., & Charitsis, C. (2024). Human AI Collaboration in Software Engineering: Lessons Learned from a Hands On Workshop. Proceedings of the ACM/IEEE International Workshop on Software-Intensive Business. 10.1145/3643690.3648236
Peng, S., Kalliamvakou, E., Cihon, P., & Demirer, M. (2023). The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv Preprint.
Hutchins, E. (1995). Cognition in the Wild. MIT Press. 10.7551/mitpress/1881.001.0001
Star, S. L., & Griesemer, J. R. (1989). Institutional Ecology, `Translations,’ and Boundary Objects: Amateurs and Professionals in Berkeley’s Museum of Vertebrate Zoology, 1907–39. Social Studies of Science, 19(3), 387–420. 10.1177/030631289019003001
Ingold, T. (2013). Making: Anthropology, Archaeology, Art and Architecture. Routledge.
Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golber, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muenninghoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., … Zou, A. (2023). A Framework for Few-shot Language Model Evaluation. Zenodo. 10.5281/zenodo.10256836