bead.data

Core data models and utilities used throughout the bead pipeline.

Base Models

base

Root didactic model for all bead objects.

Provides BeadBaseModel, the root didactic.api.Model that every bead data model inherits from. Supplies UUIDv7 identity, UTC creation and modification timestamps, schema versioning, and a metadata dictionary.

Models are frozen; updates produce new instances through with_ or the convenience method touched (which refreshes modified_at).

BeadBaseModel

Bases: Model

Root didactic model for all bead objects.

Attributes:

Name Type Description
id UUID

UUIDv7 generated at construction time.

created_at datetime

UTC timestamp of construction.

modified_at datetime

UTC timestamp of the last touched call (defaults to created_at).

version str

Schema version string.

metadata dict[str, JsonValue]

Free-form key-value annotations.

touched() -> Self

Return a copy with modified_at set to the current UTC time.

Identifiers and Timestamps

identifiers

UUIDv7 generation and utilities for bead package.

This module provides functions for generating time-ordered UUIDv7 identifiers, extracting timestamps from them, and validating UUID versions.

generate_uuid() -> UUID

Generate a time-ordered UUIDv7.

UUIDv7 is a time-ordered UUID format that embeds a timestamp in the first 48 bits, making UUIDs sortable by creation time. This is useful for maintaining chronological ordering of database records.

Returns:

Type Description
UUID

A newly generated UUIDv7 with embedded timestamp

Examples:

>>> uuid1 = generate_uuid()
>>> uuid2 = generate_uuid()
>>> uuid1 < uuid2  # uuids are time-ordered
True

extract_timestamp(uuid: UUID) -> int

Extract timestamp in milliseconds from a UUIDv7.

The timestamp is stored in the first 48 bits of the UUID and represents milliseconds since Unix epoch (January 1, 1970 00:00:00 UTC).

Parameters:

Name Type Description Default
uuid UUID

The UUIDv7 to extract timestamp from.

required

Returns:

Type Description
int

Timestamp in milliseconds since Unix epoch

Examples:

>>> import time
>>> uuid = generate_uuid()
>>> timestamp = extract_timestamp(uuid)
>>> current_time = int(time.time() * 1000)
>>> abs(timestamp - current_time) < 1000  # within 1 second
True

is_valid_uuid7(uuid: UUID) -> bool

Check if a UUID is a valid UUIDv7.

Validates that the UUID has version 7 by checking the version bits (bits 48-51) which should be 0111 (7).

Parameters:

Name Type Description Default
uuid UUID

The UUID to validate.

required

Returns:

Type Description
bool

True if the UUID is version 7, False otherwise

Examples:

>>> uuid7 = generate_uuid()
>>> is_valid_uuid7(uuid7)
True
>>> from uuid import uuid4
>>> uuid4_val = uuid4()
>>> is_valid_uuid7(uuid4_val)
False

timestamps

ISO 8601 timestamp utilities for bead package.

This module provides functions for creating, parsing, and formatting ISO 8601 timestamps with timezone information. All timestamps use UTC timezone.

now_iso8601() -> datetime

Get current UTC datetime with timezone information.

Returns the current time in UTC with timezone info attached. This is preferred over datetime.utcnow() which is deprecated and doesn't include timezone information.

Returns:

Type Description
datetime

Current UTC datetime with timezone information

Examples:

>>> dt = now_iso8601()
>>> dt.tzinfo is not None
True
>>> dt.tzinfo == UTC
True

parse_iso8601(timestamp: str) -> datetime

Parse ISO 8601 timestamp string to datetime.

Parses an ISO 8601 formatted string into a datetime object. The string should include timezone information.

Parameters:

Name Type Description Default
timestamp str

ISO 8601 formatted timestamp string (e.g., "2025-10-17T14:23:45.123456+00:00").

required

Returns:

Type Description
datetime

Parsed datetime with timezone information

Examples:

>>> dt_str = "2025-10-17T14:23:45.123456+00:00"
>>> dt = parse_iso8601(dt_str)
>>> dt.year
2025
>>> dt.month
10

format_iso8601(dt: datetime) -> str

Format datetime as ISO 8601 string.

Converts a datetime object to an ISO 8601 formatted string. If the datetime doesn't have timezone information, it will be assumed to be UTC.

Parameters:

Name Type Description Default
dt datetime

Datetime to format.

required

Returns:

Type Description
str

ISO 8601 formatted string

Examples:

>>> dt = now_iso8601()
>>> formatted = format_iso8601(dt)
>>> "+00:00" in formatted or "Z" in formatted
True

Metadata and Validation

metadata

Metadata tracking models for provenance and processing history.

Tracks provenance chains and processing history for full data lineage. Models are frozen; updates return new instances through pure with_* methods.

ProvenanceRecord

Bases: BeadBaseModel

A single parent-child relationship in a provenance chain.

Attributes:

Name Type Description
parent_id UUID

UUID of the parent object.

parent_type str

Type name of the parent object (e.g. "LexicalItem").

relationship str

Nature of the relationship (e.g. "derived_from").

timestamp datetime

When this relationship was established.

ProcessingRecord

Bases: BeadBaseModel

A single processing operation in an object's history.

Attributes:

Name Type Description
operation str

Name of the operation.

parameters dict[str, JsonValue]

Parameters passed to the operation.

timestamp datetime

When the operation was performed.

operator str | None

Identity of the agent that performed the operation.

MetadataTracker

Bases: BeadBaseModel

Frozen tracker for provenance and processing history.

Attributes:

Name Type Description
provenance tuple[ProvenanceRecord, ...]

Provenance relationships in insertion order.

processing_history tuple[ProcessingRecord, ...]

Processing operations in chronological order.

custom_metadata dict[str, JsonValue]

Custom annotations.

Examples:

>>> from uuid import uuid4
>>> tracker = MetadataTracker()
>>> parent_id = uuid4()
>>> tracker = tracker.with_provenance(parent_id, "Template", "filled_from")
>>> tracker = tracker.with_processing("fill_template", {"strategy": "exhaustive"})
>>> len(tracker.provenance)
1
>>> len(tracker.processing_history)
1

with_provenance(parent_id: UUID, parent_type: str, relationship: str) -> Self

Return a new tracker with one additional provenance record.

with_processing(operation: str, parameters: dict[str, JsonValue] | None = None, operator: str | None = None) -> Self

Return a new tracker with one additional processing record.

get_provenance_chain() -> tuple[UUID, ...]

Return the parent UUIDs of every provenance record in order.

get_recent_processing(n: int = 5) -> tuple[ProcessingRecord, ...]

Return the n most recent processing records, newest first.

validation

Validation utilities for data integrity checks.

Provides functions beyond didactic's built-in validation, including JSONLines-file validation, UUID-reference validation, and provenance-chain validation.

ValidationReport

Bases: Model

A frozen report of validation results.

Attributes:

Name Type Description
valid bool

Overall validation status. Set to False once any error is added.

errors tuple[str, ...]

Error messages.

warnings tuple[str, ...]

Warning messages.

object_count int

Number of objects validated.

Examples:

>>> report = ValidationReport(valid=True)
>>> report = report.add_error("Invalid field")
>>> report.valid
False
>>> report.has_errors()
True
>>> len(report.errors)
1

add_error(message: str) -> Self

Return a new report with message appended and valid=False.

add_warning(message: str) -> Self

Return a new report with message appended to warnings.

has_errors() -> bool

Return whether the report contains any errors.

has_warnings() -> bool

Return whether the report contains any warnings.

validate_jsonlines_file(path: Path, model_class: type[dx.Model], strict: bool = True) -> ValidationReport

Validate every line of path against model_class.

Parameters:

Name Type Description Default
path Path

Path to the JSONLines file.

required
model_class type[Model]

didactic Model class to validate against.

required
strict bool

If True, return on the first error.

True

Returns:

Type Description
ValidationReport

Report containing the collected errors and the count of validated records.

validate_uuid_references(objects: Sequence[dx.Model], reference_pool: Mapping[UUID, dx.Model]) -> ValidationReport

Verify every UUID-typed field in objects points into reference_pool.

Supports single UUID fields and tuple/list-of-UUID fields. The object's own id attribute is excluded from the check.

validate_provenance_chain(metadata: MetadataTracker, repository: Mapping[UUID, dx.Model]) -> ValidationReport

Validate every parent reference in metadata's provenance chain.

Serialization

serialization

JSONLines serialization utilities for didactic Models.

Functions for reading, writing, streaming, and appending didactic Models to and from JSONLines format files.

SerializationError

Bases: Exception

Raised when serialization to JSONLines fails.

DeserializationError

Bases: Exception

Raised when deserialization from JSONLines fails.

write_jsonlines(objects: Sequence[T], path: Path | str, validate: bool = True, append: bool = False) -> None

Write objects to path as JSONLines.

Parameters:

Name Type Description Default
objects Sequence[T]

Models to serialize.

required
path Path | str

Output file path.

required
validate bool

Unused; retained for API compatibility.

True
append bool

Whether to append to an existing file.

False

Raises:

Type Description
SerializationError

If writing fails.

read_jsonlines(path: Path | str, model_class: type[T], validate: bool = True, skip_errors: bool = False) -> list[T]

Read JSONLines from path into a list of model_class instances.

stream_jsonlines(path: Path | str, model_class: type[T], validate: bool = True) -> Iterator[T]

Yield model_class instances from path one line at a time.

append_jsonlines(objects: Sequence[T], path: Path | str, validate: bool = True) -> None

Append objects to path as JSONLines.

Utilities

language_codes

ISO 639 language code validation and utilities.

validate_iso639_code(code: str | None) -> str | None

Validate language code against ISO 639-1 or ISO 639-3.

Parameters:

Name Type Description Default
code str | None

Language code to validate (e.g., "en", "eng", "ko", "kor").

required

Returns:

Type Description
str | None

Normalized language code (converted to ISO 639-3 if valid).

Raises:

Type Description
ValueError

If code is not a valid ISO 639 language code.

Examples:

>>> validate_iso639_code("en")
'eng'
>>> validate_iso639_code("eng")
'eng'
>>> validate_iso639_code("ko")
'kor'
>>> validate_iso639_code(None)
None
>>> validate_iso639_code("invalid")
Traceback (most recent call last):
    ...
ValueError: Invalid language code: 'invalid'

repository

Repository pattern for didactic Models with optional caching.

A generic repository over a JSONLines file plus an optional in-memory cache. Mutations to a stored Model produce a new instance (didactic Models are frozen); the repository writes the new instance back via update.

Repository

Generic CRUD repository for didactic Models persisted as JSONLines.

Parameters:

Name Type Description Default
model_class type[T]

The didactic Model class this repository manages.

required
storage_path Path

Path to the JSONLines file for persistent storage.

required
use_cache bool

Whether to use the in-memory cache.

True

get(object_id: UUID) -> T | None

Return the object with object_id if present, else None.

get_all() -> list[T]

Return every object in the repository.

add(obj: T) -> None

Append obj to storage and update the cache.

add_many(objects: list[T]) -> None

Append every object in objects to storage and update the cache.

update(obj: T) -> None

Replace the stored object with the same id by obj.

delete(object_id: UUID) -> None

Remove the object with object_id from storage.

exists(object_id: UUID) -> bool

Return whether an object with object_id exists.

count() -> int

Return the number of objects in the repository.

clear() -> None

Drop every object and delete the storage file.

rebuild_cache() -> None

Reload the cache from storage.