quivers.formulas.formula

Parsed-formula IR: Formula, FixedColumn, RandomTerm, plus the formula_from_data adapter over formulae.design_matrices.

formula

Parsed-formula IR: a typed didactic.api.Model wrapping the raw formulae.matrices.DesignMatrices so the rest of the formula frontend operates on typed values.

The Formula IR is the canonical source representation of the formula→QVR lens. Future versions can register it as a panproto protocol so the lens machinery applies; for now the compiler walks this IR directly.

Convention

Each fixed-effect term may produce one or more design-matrix columns (a single column for x, two for poly(x, 2), K for an unordered factor with K + 1 levels, etc.). R / brms assign one coefficient per column; this IR follows the same convention by exploding each term into a tuple of FixedColumn records. Multi-column terms thus produce multiple named scalar latents downstream, with deterministic naming {term}_1, {term}_2, ... that mirrors R's poly(x, 2)1 / poly(x, 2)2 display.

Polynomial default: formulae.design_matrices's poly transform is orthogonal by default (matches R's stats::poly). Raw monomials are available via I(x^2) / I(x**2). Transforms log, exp, sqrt, abs, sin, cos, tan, log10, log2, log1p, expm1 are wired through the formulae evaluation namespace so users coming from R get the expected base R behaviour.

FixedColumn

Bases: Model

One column of the fixed-effects design matrix.

ATTRIBUTE DESCRIPTION
term

Originating term name (e.g. "poly(x, 2)" or "x").

TYPE: str

name

Per-column label, equal to term for single-column terms and f"{term}_{k+1}" (1-indexed, matching R's display) for multi-column terms like poly(x, 2).

TYPE: str

qvr_name

QVR-legal identifier derived from name (alnum / _ only); used as the variable name in the emitted program.

TYPE: str

is_intercept

True for the constant-1 column.

TYPE: bool

RandomTerm

Bases: Model

One random-effect group, e.g. (1 | g) or (x | g).

ATTRIBUTE DESCRIPTION
slope

"Intercept" for (1 | g); otherwise the slope variable name.

TYPE: str

group

Grouping factor name.

TYPE: str

Formula

Bases: Model

A parsed regression formula plus the data it was parsed against.

ATTRIBUTE DESCRIPTION
formula

Original formula string.

TYPE: str

response_name

Name of the response column.

TYPE: str

fixed_columns

One entry per design-matrix column (matches R/brms's one-coefficient-per-column convention).

TYPE: tuple[FixedColumn, ...]

random_terms

Random-effect group specifications.

TYPE: tuple[RandomTerm, ...]

response_values

Response column values, shape (N,).

TYPE: ndarray

group_levels

Canonical level ordering per grouping factor, used to derive deterministic plate-index tensors.

TYPE: Mapping[str, tuple[str, ...]]

group_indices

Per-group integer index array, shape (N,).

TYPE: Mapping[str, tuple[int, ...]]

FormulaData

Bases: Model

The complement of a Formula under the quivers.formulas.compile.FormulaToQVRModule lens.

The emitted QVR quivers.dsl.ast_nodes.Module carries the structural skeleton of the formula (which columns there are, keyed by their QVR-legal identifier; whether each is an intercept; the random-effect group / slope pairs; the family; the response identifier in its QVR-legal form). It does not carry:

  • the per-row data arrays (those flow through the host-data channel at fit time);
  • the per-column / per-group / response original names (the lens uses _qvr_name to normalize identifiers, which replaces non-alphanumeric characters with underscores and is therefore lossy);
  • the per-column term label (presentation, ungrouped from the lens forward output);
  • the original formula string (presentation: the lens emits a canonical AST that does not record user whitespace or operator-precedence choices).

Those fields travel in the complement. backward(module, complement) decodes the structural fields from the Module and fuses them with this carrier to reproduce the original Formula verbatim.

ATTRIBUTE DESCRIPTION
formula

Original formula string.

TYPE: str

response_name

Original (pre-_qvr_name) response column name.

TYPE: str

response_values

Response column values, shape (N,).

TYPE: ndarray

fixed_column_names

Per-column (term, name) keyed by FixedColumn.qvr_name. Lets the decoder recover FixedColumn.term and FixedColumn.name from the qvr-name surfaced in the Module's latent declarations.

TYPE: Mapping[str, tuple[str, str]]

fixed_column_data

Per-row predictor values, keyed by FixedColumn.qvr_name.

TYPE: Mapping[str, ndarray]

group_original_names

Per-group qvr_name → original group name.

TYPE: Mapping[str, str]

group_levels

Canonical per-group level ordering. Needed to populate Formula.group_levels from the integer-coded object G : K declarations the Module records.

TYPE: Mapping[str, tuple[str, ...]]

group_indices

Per-row integer codes for each grouping factor.

TYPE: Mapping[str, tuple[int, ...]]

formula_from_data

formula_from_data(formula: str, data: IntoDataFrame, *, extra_namespace: Mapping[str, object] | None = None) -> Formula

Build a typed Formula IR by lifting formulae.design_matrices over a dataframe.

This is an adapter, not a parser: the brms-style formula syntax is parsed by the formulae library; we lift its formulae.matrices.DesignMatrices result into a typed didactic record, augmented with deterministic per-group level orderings and integer-code arrays derived from the dataframe.

The R-style numeric transforms (log, exp, sqrt, abs, sin, cos, tan, log10, log2, log1p, expm1, asin, acos, atan, sinh, cosh, tanh) are pre-loaded into the formulae evaluation namespace so users coming from R / brms get the expected base R behaviour without explicit registration. Polynomial terms via poly(x, k) are orthogonal by default, matching R's stats::poly.

PARAMETER DESCRIPTION
formula

Formula string in brms / lme4 syntax.

TYPE: str

data

Pandas, polars, or any other Narwhals-compatible dataframe.

TYPE: IntoDataFrame

extra_namespace

Additional names visible inside the formula's expression evaluation, merged on top of the R-style transforms.

TYPE: Mapping[str, object] DEFAULT: None

Source code in src/quivers/formulas/formula.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
def formula_from_data(
    formula: str,
    data: IntoDataFrame,
    *,
    extra_namespace: Mapping[str, object] | None = None,
) -> Formula:
    """Build a typed `Formula` IR by lifting
    `formulae.design_matrices` over a dataframe.

    This is an adapter, not a parser: the brms-style formula syntax
    is parsed by the [`formulae`](https://bambinos.github.io/formulae/)
    library; we lift its `formulae.matrices.DesignMatrices`
    result into a typed didactic record, augmented with deterministic
    per-group level orderings and integer-code arrays derived from
    the dataframe.

    The R-style numeric transforms (``log``, ``exp``, ``sqrt``,
    ``abs``, ``sin``, ``cos``, ``tan``, ``log10``, ``log2``,
    ``log1p``, ``expm1``, ``asin``, ``acos``, ``atan``, ``sinh``,
    ``cosh``, ``tanh``) are pre-loaded into the formulae evaluation
    namespace so users coming from R / brms get the expected base
    R behaviour without explicit registration.  Polynomial terms via
    ``poly(x, k)`` are orthogonal by default, matching R's
    ``stats::poly``.

    Parameters
    ----------
    formula : str
        Formula string in brms / lme4 syntax.
    data : IntoDataFrame
        Pandas, polars, or any other Narwhals-compatible dataframe.
    extra_namespace : Mapping[str, object], optional
        Additional names visible inside the formula's expression
        evaluation, merged on top of the R-style transforms.
    """
    nw_df = nw.from_native(data, eager_only=True)
    pandas_df = nw_df.to_pandas()
    namespace: dict[str, object] = dict(_R_TRANSFORMS)
    if extra_namespace:
        namespace.update(extra_namespace)
    dm = fo.design_matrices(formula, data=pandas_df, extra_namespace=namespace)
    if dm.response is None:
        raise ValueError(
            f"formula_from_data: formula {formula!r} has no response "
            f"variable on the left of `~`"
        )
    response_name = dm.response.name
    n_obs = int(pandas_df.shape[0])

    fixed_columns: list[FixedColumn] = []
    if dm.common is not None:
        for term_name, term in dm.common.terms.items():
            fixed_columns.extend(_explode_term(term_name, term, n_obs))

    random_terms: list[RandomTerm] = []
    group_levels: dict[str, tuple[str, ...]] = {}
    group_indices: dict[str, tuple[int, ...]] = {}
    if dm.group is not None:
        for term_name in dm.group.terms.keys():
            if "|" not in term_name:
                raise ValueError(
                    f"formula_from_data: unexpected random term name "
                    f"{term_name!r}; expected `(slope | group)` syntax"
                )
            slope, group = term_name.split("|", 1)
            slope = slope.strip()
            group = group.strip()
            if slope == "1":
                slope = "Intercept"
            random_terms.append(RandomTerm(slope=slope, group=group))
            if group not in group_levels:
                levels = tuple(
                    str(v) for v in nw_df[group].drop_nulls().unique().sort().to_list()
                )
                group_levels[group] = levels
                level_index = {v: i for i, v in enumerate(levels)}
                codes = tuple(level_index[str(v)] for v in nw_df[group].to_list())
                group_indices[group] = codes

    response_values = (
        np.asarray(dm.response.design_matrix).reshape(-1).astype(np.float64)
    )

    return Formula(
        formula=formula,
        response_name=response_name,
        fixed_columns=tuple(fixed_columns),
        random_terms=tuple(random_terms),
        response_values=response_values,
        group_levels=group_levels,
        group_indices=group_indices,
    )