Analysis: Data and Formulas

This page covers the front-half of the analysis stack: feeding dataframes into models, declaring a model with a brms-style formula, and inspecting / emitting the QVR source the formula compiles to. The back-half (fitting, diagnostics, algebra-guided training tooling) lives in Fitting and Diagnostics.

Architecture

Four small subpackages, each consumable independently:

flowchart TB
    F["quivers.formulas<br/>brms-style formula to typed AST to QVR program"]
    D["quivers.data<br/>DataFrame to object cardinalities + observations"]
    G["quivers.diagnostics<br/>MCMCResult to ArviZ DataTree, compare, PPC"]
    E["quivers.dsl.emit<br/>Module AST to canonical .qvr source"]
    F --> D
    F --> E
    F --> G
    D --> G

Each subpackage is gated behind an optional dependency extra so a user who only wants the DSL + inference doesn't pull pandas / polars / arviz / formulae. Install everything together via pip install "quivers[analysis]".

Dataframes: quivers.data

DatasetSchema is a typed didactic.api.Model that maps dataframe columns to QVR-program artifacts. It accepts pandas, polars, or any other Narwhals-compatible dataframe.

import pandas as pd
from quivers.data import DatasetSchema, compose

df = pd.DataFrame({
    "verb": ["eat", "drink", "run", "eat", ...],
    "subject": ["s1", "s2", "s1", "s3", ...],
    "rt": [0.31, 0.42, 0.28, 0.55, ...],
    "response": [1, 0, 1, 1, ...],
})

schema = DatasetSchema(
    df=df,
    objects={"verb": "Verb", "subject": "Subject"},
    plate_indices={"verb": "verb_idx", "subject": "subj_idx"},
    covariates={"rt": "rt"},
    observations={"response": "y"},
)

print(schema.declarations())          # object Verb : FinSet 17 / object Subject : FinSet 50
print(schema.cardinalities)           # {"Verb": 17, "Subject": 50}
obs = schema.observations_dict()      # {"verb_idx": tensor, "subj_idx": tensor, ...}

Two artifacts come out:

  • declarations() emits a .qvr prelude with one object X : N line per declared object axis. The cardinality is inferred from df[col].n_unique(); canonical category ordering is the column's sorted unique non-null values so plate indices are reproducible across reruns.
  • observations_dict() packs the per-row tensors that inference consumes (response, plate indices, numeric covariates), ready to pass into SVI.step or MCMC.run.

The companion compose(qvr_body, schema) prepends the schema's declarations to a user's .qvr body before compiling, so the user writes only the program body and the cardinalities come from the data.

Missing-data handling is configurable per schema via MissingPolicy: RAISE (default), DROP, IMPUTE, or MASK.

Formulas: quivers.formulas

The formula frontend compiles a brms / lme4-style formula into a typed QVR Module AST. No source-string concatenation: the translation FormulaToQVRModule is a didactic.api.Lens from Formula to Module, mirroring the existing resolution-lens pattern in quivers.dsl.resolution. Formula syntax is parsed by the formulae library (the Bambi team's pure-Python brms-style parser), then lifted into a typed Formula record.

Inspect or dump the generated QVR

from quivers.formulas import formula_to_qvr

src = formula_to_qvr("y ~ poly(x, 2) + (1 | g)", data=df)
print(src)                                  # canonical .qvr source

The emit goes through quivers.dsl.emit.module_to_source, which walks the Module AST and produces canonical .qvr source. The emitted source re-parses through quivers.dsl.loads into a Module that compiles to the same program: the round-trip is exercised on every formula in the test suite.

R / brms behaviour, exactly

  • Orthogonal polynomials by default. poly(x, k) produces \(k\) orthonormal centred columns, matching R's stats::poly. Raw monomials remain available via I(x**k).
  • One coefficient per design-matrix column (matches brms display). poly(x, 2) produces two named coefficients beta_poly_x_2_1 and beta_poly_x_2_2; x*z produces three named coefficients (beta_x, beta_z, beta_x_z). The per-column data flows in as a free variable via the host-data channel (see the conditioning surface).
  • R-style transforms preloaded into the formulae evaluation namespace: log, exp, sqrt, abs, sin, cos, tan, log10, log2, log1p, expm1, asin, acos, atan, sinh, cosh, tanh. No registration required.
  • Random-effect groups (1 | g), (1 + x | g), (x | g), (0 + x | g) parse identically to brms / lme4. Multiple slopes per group emit independent random-effect terms (the lme4 (... || g) uncorrelated semantics); correlated LKJ-prior slopes are future scope.
  • Interactions x:z (elementwise product, one coefficient) and x*z (expands to x + z + x:z, three coefficients).

Family registry

fit(..., family=...) accepts a string name or a Family value. The ten brms-canonical families:

Family Link (inverse) Auxiliary parameters
gaussian identity sigma ~ HalfCauchy(2.0)
bernoulli, binomial logit (sigmoid)
categorical softmax
poisson log (exp)
negative_binomial log (exp) disp ~ Gamma(2.0, 2.0)
gamma log (exp) shape ~ Gamma(2.0, 2.0)
beta logit (sigmoid) phi ~ HalfCauchy(2.0)
student_t identity nu ~ Gamma(2.0, 0.1), sigma ~ HalfCauchy(2.0)
cumulative identity

Custom families are pluggable: subclass Family and register your own observe kernel and link.

Prior overrides

Prior overrides are keyed by the latent's name in the emitted QVR program (which formula_to_qvr lets you inspect upfront). The prior template is a brms-style Family(arg, arg, ...) call; numeric args become floats, identifier args stay as references to other latents in the program. The full call shape lives in Fitting and Diagnostics.

See also