quivers.data

Dataframe-side surface: schema inference, observation packing, and DSL composition helpers. Accepts pandas, polars, or any other Narwhals-compatible backend via the IntoDataFrame shim.

data

Dataframe-side surface: schema inference, observation packing, and DSL composition helpers.

Bridges between user dataframes (pandas, polars, or any Narwhals-compatible backend) and the QVR DSL: derives object cardinalities from df[col].n_unique(), builds the per-row plate-index tensors from deterministic categorical orderings, and emits the object declarations + observations dict consumed by inference.

The dataframe library is not a hard dependency. Users install pandas, polars, or any other Narwhals-supported backend; DatasetSchema accepts whichever they hand in.

ColumnRole

Bases: str, Enum

How a dataframe column participates in a QVR program.

MissingPolicy

Bases: str, Enum

How to handle NaN / null entries when encoding a column.

DatasetSchema

Bases: Model

Mapping from dataframe columns to QVR program artefacts.

ATTRIBUTE DESCRIPTION
df

Source dataframe; pandas, polars, modin, dask, pyarrow, or anything else Narwhals' from_native accepts. Stored as an opaque field so the schema can be serialized without depending on a specific dataframe flavour.

TYPE: Any

objects

Map from column name to the QVR object name. The object's cardinality is inferred from the column's number of unique values; the canonical ordering is the sorted set of unique values, so plate indices are deterministic across reruns.

TYPE: Mapping[str, str]

observations

Map from column name to the QVR observe-site name. Categorical columns are encoded to LongTensor codes (using either their own object's category ordering, when the column is also listed under objects, or a sorted-unique fallback); numeric columns to FloatTensor.

TYPE: Mapping[str, str]

plate_indices

Map from column name (which must also appear under objects) to the per-row plate-index variable name. Encoded as LongTensor of category codes; one entry per row.

TYPE: Mapping[str, str]

covariates

Map from numeric column name to the QVR variable name to bind the column's values to (as a FloatTensor).

TYPE: Mapping[str, str]

missing_policy

Policy applied to every column with nulls. Default quivers.data.encoding.MissingPolicy.RAISE.

TYPE: MissingPolicy

cardinalities

cardinalities() -> Mapping[str, int]

Inferred object cardinalities, keyed by QVR object name.

Source code in src/quivers/data/schema.py
123
124
125
126
127
128
129
130
131
132
@dx.derived
def cardinalities(self) -> Mapping[str, int]:
    """Inferred object cardinalities, keyed by QVR object name."""
    # Touch _nw_df to trigger validation even on schemas that
    # declare no object columns.
    _ = self._nw_df
    return {
        obj_name: len(self._categories[col])
        for col, obj_name in self.objects.items()
    }

categories

categories(column: str) -> tuple[str, ...]

Canonical ordering of values for an object-column.

Codes are assigned as categories.index(value); the ordering is the column's sorted unique non-null values, so the same dataframe always produces the same indices.

Source code in src/quivers/data/schema.py
134
135
136
137
138
139
140
141
142
143
144
145
146
def categories(self, column: str) -> tuple[str, ...]:
    """Canonical ordering of values for an object-column.

    Codes are assigned as ``categories.index(value)``; the
    ordering is the column's sorted unique non-null values, so
    the same dataframe always produces the same indices.
    """
    if column not in self._categories:
        raise KeyError(
            f"DatasetSchema.categories: column {column!r} is not "
            f"declared as an object column"
        )
    return self._categories[column]

declarations

declarations() -> str

Emit a .qvr declaration prelude.

Lines are object <Name> : FinSet <cardinality>, sorted by name for reproducibility. Suitable for prepending to a user's .qvr source via compose.

Source code in src/quivers/data/schema.py
148
149
150
151
152
153
154
155
156
157
158
159
160
def declarations(self) -> str:
    """Emit a ``.qvr`` declaration prelude.

    Lines are ``object <Name> : FinSet <cardinality>``, sorted
    by name for reproducibility. Suitable for prepending to a
    user's ``.qvr`` source via `compose`.
    """
    sorted_objs = sorted(self.objects.items(), key=lambda kv: kv[1])
    lines = [
        f"object {obj_name} : FinSet {self.cardinalities[obj_name]}"
        for _, obj_name in sorted_objs
    ]
    return "\n".join(lines) + ("\n" if lines else "")

observations_dict

observations_dict() -> dict[str, Tensor]

Build the observations dict for inference.

Contains entries for every observation, plate-index, and covariate column. Categorical observations and plate indices use the canonical ordering returned by categories; numeric observations and covariates become FloatTensor.

Source code in src/quivers/data/schema.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
def observations_dict(self) -> dict[str, torch.Tensor]:
    """Build the observations dict for inference.

    Contains entries for every observation, plate-index, and
    covariate column.  Categorical observations and plate
    indices use the canonical ordering returned by
    `categories`; numeric observations and covariates
    become ``FloatTensor``.
    """
    result: dict[str, torch.Tensor] = {}

    for col, site in self.observations.items():
        cats: tuple[str, ...] | None = None
        if col in self.objects:
            cats = self._categories[col]
        else:
            dtype = self._nw_df[col].dtype
            if dtype == nw.String:
                cats = tuple(
                    str(v)
                    for v in self._nw_df[col].drop_nulls().unique().sort().to_list()
                )
        result[site] = encode_column(
            self._nw_df,
            col,
            role=ColumnRole.OBSERVATION,
            categories=cats,
            missing_policy=self.missing_policy,
        )

    for col, var in self.plate_indices.items():
        result[var] = encode_column(
            self._nw_df,
            col,
            role=ColumnRole.PLATE_INDEX,
            categories=self._categories[col],
            missing_policy=self.missing_policy,
        )

    for col, var in self.covariates.items():
        result[var] = encode_column(
            self._nw_df,
            col,
            role=ColumnRole.COVARIATE,
            missing_policy=self.missing_policy,
        )

    return result

encode_column

encode_column(df: DataFrame, column: str, *, role: ColumnRole, categories: tuple[str, ...] | None = None, missing_policy: MissingPolicy = RAISE) -> Tensor

Encode a single column into a torch.Tensor ready for QVR inference.

PARAMETER DESCRIPTION
df

Narwhals-wrapped dataframe.

TYPE: DataFrame

column

Column to encode.

TYPE: str

role

How the column participates in the program. PLATE_INDEX and OBJECT columns require a categories tuple for reproducible code assignment.

TYPE: ColumnRole

categories

Canonical ordering of categorical values; if provided, codes are assigned by categories.index(value). Required for PLATE_INDEX and for OBSERVATION of a non-numeric column. None is allowed for numeric OBSERVATION / COVARIATE columns.

TYPE: tuple[str, ...] or None DEFAULT: None

missing_policy

Policy for NaN / null handling.

TYPE: MissingPolicy DEFAULT: RAISE

RETURNS DESCRIPTION
Tensor

LongTensor for categorical encodings, FloatTensor otherwise.

Source code in src/quivers/data/encoding.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
def encode_column(
    df: nw.DataFrame,
    column: str,
    *,
    role: ColumnRole,
    categories: tuple[str, ...] | None = None,
    missing_policy: MissingPolicy = MissingPolicy.RAISE,
) -> torch.Tensor:
    """Encode a single column into a ``torch.Tensor`` ready for
    QVR inference.

    Parameters
    ----------
    df : nw.DataFrame
        Narwhals-wrapped dataframe.
    column : str
        Column to encode.
    role : ColumnRole
        How the column participates in the program. ``PLATE_INDEX``
        and ``OBJECT`` columns require a categories tuple for
        reproducible code assignment.
    categories : tuple[str, ...] or None
        Canonical ordering of categorical values; if provided, codes
        are assigned by ``categories.index(value)``. Required for
        ``PLATE_INDEX`` and for ``OBSERVATION`` of a non-numeric
        column. ``None`` is allowed for numeric ``OBSERVATION`` /
        ``COVARIATE`` columns.
    missing_policy : MissingPolicy
        Policy for ``NaN`` / null handling.

    Returns
    -------
    torch.Tensor
        ``LongTensor`` for categorical encodings, ``FloatTensor``
        otherwise.
    """
    series = df[column]
    dtype = series.dtype
    is_numeric = _is_numeric_dtype(dtype)
    null_count = series.is_null().sum()

    if null_count > 0:
        if missing_policy == MissingPolicy.RAISE:
            raise ValueError(
                f"column {column!r} has {null_count} missing values "
                f"but missing_policy={MissingPolicy.RAISE.value}"
            )
        if missing_policy == MissingPolicy.DROP:
            raise ValueError(
                f"column {column!r}: MissingPolicy.DROP requires the "
                f"caller to pre-filter the dataframe; this function "
                f"encodes the column as given"
            )
        if missing_policy == MissingPolicy.IMPUTE:
            if is_numeric:
                fill = series.mean()
            else:
                # Modal value: take the value with the highest count.
                counts = series.drop_nulls().value_counts(name="_count_")
                fill = counts.sort("_count_", descending=True)[column][0]
            series = series.fill_null(fill)
        # MASK falls through; NaN -> NaN for numeric, -1 code for
        # categorical (handled below).

    if role in (ColumnRole.PLATE_INDEX, ColumnRole.OBJECT):
        if categories is None:
            raise ValueError(
                f"encode_column: role={role.value} requires a "
                f"categories ordering for column {column!r}"
            )
        cat_index = {c: i for i, c in enumerate(categories)}
        values = series.to_list()
        codes = [cat_index[v] if v is not None else -1 for v in values]
        return torch.tensor(codes, dtype=torch.long)

    if role == ColumnRole.OBSERVATION and not is_numeric:
        if categories is None:
            raise ValueError(
                f"encode_column: non-numeric observation column "
                f"{column!r} requires a categories ordering"
            )
        cat_index = {c: i for i, c in enumerate(categories)}
        values = series.to_list()
        codes = [cat_index[v] if v is not None else -1 for v in values]
        return torch.tensor(codes, dtype=torch.long)

    # Numeric observation or covariate path.
    values = series.to_list()
    return torch.tensor(
        [float("nan") if v is None else float(v) for v in values],
        dtype=torch.float32,
    )

compose

compose(qvr_body: str, schema: DatasetSchema, **kwargs)

Compile a .qvr body against a dataset schema.

Prepends the schema's object declarations to qvr_body, then calls quivers.dsl.loads. The user writes only the program body (latents, kernels, observations, return); object cardinalities inferred from the dataframe are slotted in automatically. If the body re-declares an object that appears in the schema, the body's declaration wins.

PARAMETER DESCRIPTION
qvr_body

QVR source without the object declarations covered by schema.objects.

TYPE: str

schema

Dataframe schema providing cardinalities.

TYPE: DatasetSchema

**kwargs

Forwarded to quivers.dsl.loads (e.g. data=... for from_data lookups).

DEFAULT: {}

Source code in src/quivers/data/schema.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
def compose(qvr_body: str, schema: DatasetSchema, **kwargs):
    """Compile a ``.qvr`` body against a dataset schema.

    Prepends the schema's ``object`` declarations to ``qvr_body``,
    then calls `quivers.dsl.loads`.  The user writes only the
    program body (latents, kernels, observations, return); object
    cardinalities inferred from the dataframe are slotted in
    automatically.  If the body re-declares an object that appears
    in the schema, the body's declaration wins.

    Parameters
    ----------
    qvr_body : str
        QVR source without the ``object`` declarations covered by
        ``schema.objects``.
    schema : DatasetSchema
        Dataframe schema providing cardinalities.
    **kwargs
        Forwarded to `quivers.dsl.loads` (e.g. ``data=...`` for
        ``from_data`` lookups).
    """
    body_declares: set[str] = set()
    for line in qvr_body.splitlines():
        stripped = line.strip()
        if stripped.startswith("object "):
            after = stripped[len("object ") :].split(":")[0].split("=")[0]
            body_declares.add(after.strip())

    prelude_lines = []
    for _, obj_name in sorted(schema.objects.items(), key=lambda kv: kv[1]):
        if obj_name in body_declares:
            continue
        prelude_lines.append(
            f"object {obj_name} : FinSet {schema.cardinalities[obj_name]}"
        )
    prelude = "\n".join(prelude_lines)
    if prelude:
        prelude += "\n\n"
    return loads(prelude + qvr_body, **kwargs)