`quivers.data`¶

Dataframe-side surface: schema inference, observation packing, and DSL composition helpers. Accepts pandas, polars, or any other Narwhals-compatible backend via the IntoDataFrame shim.

data ¶

Dataframe-side surface: schema inference, observation packing, and DSL composition helpers.

Bridges between user dataframes (pandas, polars, or any Narwhals-compatible backend) and the QVR DSL: derives object cardinalities from df[col].n_unique(), builds the per-row plate-index tensors from deterministic categorical orderings, and emits the object declarations + observations dict consumed by inference.

The dataframe library is not a hard dependency. Users install pandas, polars, or any other Narwhals-supported backend; DatasetSchema accepts whichever they hand in.

ColumnRole ¶

Bases: str, Enum

How a dataframe column participates in a QVR program.

MissingPolicy ¶

Bases: str, Enum

How to handle NaN / null entries when encoding a column.

DatasetSchema ¶

Bases: Model

Mapping from dataframe columns to QVR program artefacts.

ATTRIBUTE	DESCRIPTION
`df`	Source dataframe; pandas, polars, modin, dask, pyarrow, or anything else Narwhals' `from_native` accepts. Stored as an opaque field so the schema can be serialized without depending on a specific dataframe flavour. TYPE: `Any`
`objects`	Map from column name to the QVR object name. The object's cardinality is inferred from the column's number of unique values; the canonical ordering is the sorted set of unique values, so plate indices are deterministic across reruns. TYPE: `Mapping[str, str]`
`observations`	Map from column name to the QVR observe-site name. Categorical columns are encoded to `LongTensor` codes (using either their own object's category ordering, when the column is also listed under `objects`, or a sorted-unique fallback); numeric columns to `FloatTensor`. TYPE: `Mapping[str, str]`
`plate_indices`	Map from column name (which must also appear under `objects`) to the per-row plate-index variable name. Encoded as `LongTensor` of category codes; one entry per row. TYPE: `Mapping[str, str]`
`covariates`	Map from numeric column name to the QVR variable name to bind the column's values to (as a `FloatTensor`). TYPE: `Mapping[str, str]`
`missing_policy`	Policy applied to every column with nulls. Default `quivers.data.encoding.MissingPolicy.RAISE`. TYPE: `MissingPolicy`

cardinalities ¶

cardinalities() -> Mapping[str, int]

Inferred object cardinalities, keyed by QVR object name.

Source code in src/quivers/data/schema.py

@dx.derived
def cardinalities(self) -> Mapping[str, int]:
    """Inferred object cardinalities, keyed by QVR object name."""
    # Touch _nw_df to trigger validation even on schemas that
    # declare no object columns.
    _ = self._nw_df
    return {
        obj_name: len(self._categories[col])
        for col, obj_name in self.objects.items()
    }

categories ¶

categories(column: str) -> tuple[str, ...]

Canonical ordering of values for an object-column.

Codes are assigned as categories.index(value); the ordering is the column's sorted unique non-null values, so the same dataframe always produces the same indices.

Source code in src/quivers/data/schema.py

def categories(self, column: str) -> tuple[str, ...]:
    """Canonical ordering of values for an object-column.

    Codes are assigned as ``categories.index(value)``; the
    ordering is the column's sorted unique non-null values, so
    the same dataframe always produces the same indices.
    """
    if column not in self._categories:
        raise KeyError(
            f"DatasetSchema.categories: column {column!r} is not "
            f"declared as an object column"
        )
    return self._categories[column]

declarations ¶

declarations() -> str

Emit a .qvr declaration prelude.

Lines are object <Name> : FinSet <cardinality>, sorted by name for reproducibility. Suitable for prepending to a user's .qvr source via compose.

Source code in src/quivers/data/schema.py

def declarations(self) -> str:
    """Emit a ``.qvr`` declaration prelude.

    Lines are ``object <Name> : FinSet <cardinality>``, sorted
    by name for reproducibility. Suitable for prepending to a
    user's ``.qvr`` source via `compose`.
    """
    sorted_objs = sorted(self.objects.items(), key=lambda kv: kv[1])
    lines = [
        f"object {obj_name} : FinSet {self.cardinalities[obj_name]}"
        for _, obj_name in sorted_objs
    ]
    return "\n".join(lines) + ("\n" if lines else "")

observations_dict ¶

observations_dict() -> dict[str, Tensor]

Build the observations dict for inference.

Contains entries for every observation, plate-index, and covariate column. Categorical observations and plate indices use the canonical ordering returned by categories; numeric observations and covariates become FloatTensor.

Source code in src/quivers/data/schema.py

def observations_dict(self) -> dict[str, torch.Tensor]:
    """Build the observations dict for inference.

    Contains entries for every observation, plate-index, and
    covariate column.  Categorical observations and plate
    indices use the canonical ordering returned by
    `categories`; numeric observations and covariates
    become ``FloatTensor``.
    """
    result: dict[str, torch.Tensor] = {}

    for col, site in self.observations.items():
        cats: tuple[str, ...] | None = None
        if col in self.objects:
            cats = self._categories[col]
        else:
            dtype = self._nw_df[col].dtype
            if dtype == nw.String:
                cats = tuple(
                    str(v)
                    for v in self._nw_df[col].drop_nulls().unique().sort().to_list()
                )
        result[site] = encode_column(
            self._nw_df,
            col,
            role=ColumnRole.OBSERVATION,
            categories=cats,
            missing_policy=self.missing_policy,
        )

    for col, var in self.plate_indices.items():
        result[var] = encode_column(
            self._nw_df,
            col,
            role=ColumnRole.PLATE_INDEX,
            categories=self._categories[col],
            missing_policy=self.missing_policy,
        )

    for col, var in self.covariates.items():
        result[var] = encode_column(
            self._nw_df,
            col,
            role=ColumnRole.COVARIATE,
            missing_policy=self.missing_policy,
        )

    return result

encode_column ¶

encode_column(df: DataFrame, column: str, *, role: ColumnRole, categories: tuple[str, ...] | None = None, missing_policy: MissingPolicy = RAISE) -> Tensor

Encode a single column into a torch.Tensor ready for QVR inference.

PARAMETER	DESCRIPTION
`df`	Narwhals-wrapped dataframe. TYPE: `DataFrame`
`column`	Column to encode. TYPE: `str`
`role`	How the column participates in the program. `PLATE_INDEX` and `OBJECT` columns require a categories tuple for reproducible code assignment. TYPE: `ColumnRole`
`categories`	Canonical ordering of categorical values; if provided, codes are assigned by `categories.index(value)`. Required for `PLATE_INDEX` and for `OBSERVATION` of a non-numeric column. `None` is allowed for numeric `OBSERVATION` / `COVARIATE` columns. TYPE: `tuple[str, ...] or None` DEFAULT: `None`
`missing_policy`	Policy for `NaN` / null handling. TYPE: `MissingPolicy` DEFAULT: `RAISE`

RETURNS	DESCRIPTION
`Tensor`	`LongTensor` for categorical encodings, `FloatTensor` otherwise.

Source code in src/quivers/data/encoding.py

def encode_column(
    df: nw.DataFrame,
    column: str,
    *,
    role: ColumnRole,
    categories: tuple[str, ...] | None = None,
    missing_policy: MissingPolicy = MissingPolicy.RAISE,
) -> torch.Tensor:
    """Encode a single column into a ``torch.Tensor`` ready for
    QVR inference.

    Parameters
    ----------
    df : nw.DataFrame
        Narwhals-wrapped dataframe.
    column : str
        Column to encode.
    role : ColumnRole
        How the column participates in the program. ``PLATE_INDEX``
        and ``OBJECT`` columns require a categories tuple for
        reproducible code assignment.
    categories : tuple[str, ...] or None
        Canonical ordering of categorical values; if provided, codes
        are assigned by ``categories.index(value)``. Required for
        ``PLATE_INDEX`` and for ``OBSERVATION`` of a non-numeric
        column. ``None`` is allowed for numeric ``OBSERVATION`` /
        ``COVARIATE`` columns.
    missing_policy : MissingPolicy
        Policy for ``NaN`` / null handling.

    Returns
    -------
    torch.Tensor
        ``LongTensor`` for categorical encodings, ``FloatTensor``
        otherwise.
    """
    series = df[column]
    dtype = series.dtype
    is_numeric = _is_numeric_dtype(dtype)
    null_count = series.is_null().sum()

    if null_count > 0:
        if missing_policy == MissingPolicy.RAISE:
            raise ValueError(
                f"column {column!r} has {null_count} missing values "
                f"but missing_policy={MissingPolicy.RAISE.value}"
            )
        if missing_policy == MissingPolicy.DROP:
            raise ValueError(
                f"column {column!r}: MissingPolicy.DROP requires the "
                f"caller to pre-filter the dataframe; this function "
                f"encodes the column as given"
            )
        if missing_policy == MissingPolicy.IMPUTE:
            if is_numeric:
                fill = series.mean()
            else:
                # Modal value: take the value with the highest count.
                counts = series.drop_nulls().value_counts(name="_count_")
                fill = counts.sort("_count_", descending=True)[column][0]
            series = series.fill_null(fill)
        # MASK falls through; NaN -> NaN for numeric, -1 code for
        # categorical (handled below).

    if role in (ColumnRole.PLATE_INDEX, ColumnRole.OBJECT):
        if categories is None:
            raise ValueError(
                f"encode_column: role={role.value} requires a "
                f"categories ordering for column {column!r}"
            )
        cat_index = {c: i for i, c in enumerate(categories)}
        values = series.to_list()
        codes = [cat_index[v] if v is not None else -1 for v in values]
        return torch.tensor(codes, dtype=torch.long)

    if role == ColumnRole.OBSERVATION and not is_numeric:
        if categories is None:
            raise ValueError(
                f"encode_column: non-numeric observation column "
                f"{column!r} requires a categories ordering"
            )
        cat_index = {c: i for i, c in enumerate(categories)}
        values = series.to_list()
        codes = [cat_index[v] if v is not None else -1 for v in values]
        return torch.tensor(codes, dtype=torch.long)

    # Numeric observation or covariate path.
    values = series.to_list()
    return torch.tensor(
        [float("nan") if v is None else float(v) for v in values],
        dtype=torch.float32,
    )

compose ¶

compose(qvr_body: str, schema: DatasetSchema, **kwargs)

Compile a .qvr body against a dataset schema.

Prepends the schema's object declarations to qvr_body, then calls quivers.dsl.loads. The user writes only the program body (latents, kernels, observations, return); object cardinalities inferred from the dataframe are slotted in automatically. If the body re-declares an object that appears in the schema, the body's declaration wins.

PARAMETER	DESCRIPTION
`qvr_body`	QVR source without the `object` declarations covered by `schema.objects`. TYPE: `str`
`schema`	Dataframe schema providing cardinalities. TYPE: `DatasetSchema`
`**kwargs`	Forwarded to `quivers.dsl.loads` (e.g. `data=...` for `from_data` lookups). DEFAULT: `{}`

Source code in src/quivers/data/schema.py

def compose(qvr_body: str, schema: DatasetSchema, **kwargs):
    """Compile a ``.qvr`` body against a dataset schema.

    Prepends the schema's ``object`` declarations to ``qvr_body``,
    then calls `quivers.dsl.loads`.  The user writes only the
    program body (latents, kernels, observations, return); object
    cardinalities inferred from the dataframe are slotted in
    automatically.  If the body re-declares an object that appears
    in the schema, the body's declaration wins.

    Parameters
    ----------
    qvr_body : str
        QVR source without the ``object`` declarations covered by
        ``schema.objects``.
    schema : DatasetSchema
        Dataframe schema providing cardinalities.
    **kwargs
        Forwarded to `quivers.dsl.loads` (e.g. ``data=...`` for
        ``from_data`` lookups).
    """
    body_declares: set[str] = set()
    for line in qvr_body.splitlines():
        stripped = line.strip()
        if stripped.startswith("object "):
            after = stripped[len("object ") :].split(":")[0].split("=")[0]
            body_declares.add(after.strip())

    prelude_lines = []
    for _, obj_name in sorted(schema.objects.items(), key=lambda kv: kv[1]):
        if obj_name in body_declares:
            continue
        prelude_lines.append(
            f"object {obj_name} : FinSet {schema.cardinalities[obj_name]}"
        )
    prelude = "\n".join(prelude_lines)
    if prelude:
        prelude += "\n\n"
    return loads(prelude + qvr_body, **kwargs)

quivers.data¶

data ¶

ColumnRole ¶

MissingPolicy ¶

DatasetSchema ¶

cardinalities ¶

categories ¶

declarations ¶

observations_dict ¶

encode_column ¶

compose ¶

`quivers.data`¶