bead.data¶
Core data models and utilities used throughout the bead pipeline.
Base Models¶
base
¶
Root didactic model for all bead objects.
Provides BeadBaseModel, the root didactic.api.Model that every bead
data model inherits from. Supplies UUIDv7 identity, UTC creation and
modification timestamps, schema versioning, and a metadata dictionary.
Models are frozen; updates produce new instances through with_ or the
convenience method touched (which refreshes modified_at).
BeadBaseModel
¶
Bases: Model
Root didactic model for all bead objects.
Attributes:
| Name | Type | Description |
|---|---|---|
id |
UUID
|
UUIDv7 generated at construction time. |
created_at |
datetime
|
UTC timestamp of construction. |
modified_at |
datetime
|
UTC timestamp of the last |
version |
str
|
Schema version string. |
metadata |
dict[str, JsonValue]
|
Free-form key-value annotations. |
touched() -> Self
¶
Return a copy with modified_at set to the current UTC time.
Identifiers and Timestamps¶
identifiers
¶
UUIDv7 generation and utilities for bead package.
This module provides functions for generating time-ordered UUIDv7 identifiers, extracting timestamps from them, and validating UUID versions.
generate_uuid() -> UUID
¶
Generate a time-ordered UUIDv7.
UUIDv7 is a time-ordered UUID format that embeds a timestamp in the first 48 bits, making UUIDs sortable by creation time. This is useful for maintaining chronological ordering of database records.
Returns:
| Type | Description |
|---|---|
UUID
|
A newly generated UUIDv7 with embedded timestamp |
Examples:
>>> uuid1 = generate_uuid()
>>> uuid2 = generate_uuid()
>>> uuid1 < uuid2 # uuids are time-ordered
True
extract_timestamp(uuid: UUID) -> int
¶
Extract timestamp in milliseconds from a UUIDv7.
The timestamp is stored in the first 48 bits of the UUID and represents milliseconds since Unix epoch (January 1, 1970 00:00:00 UTC).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uuid
|
UUID
|
The UUIDv7 to extract timestamp from. |
required |
Returns:
| Type | Description |
|---|---|
int
|
Timestamp in milliseconds since Unix epoch |
Examples:
>>> import time
>>> uuid = generate_uuid()
>>> timestamp = extract_timestamp(uuid)
>>> current_time = int(time.time() * 1000)
>>> abs(timestamp - current_time) < 1000 # within 1 second
True
is_valid_uuid7(uuid: UUID) -> bool
¶
Check if a UUID is a valid UUIDv7.
Validates that the UUID has version 7 by checking the version bits (bits 48-51) which should be 0111 (7).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uuid
|
UUID
|
The UUID to validate. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the UUID is version 7, False otherwise |
Examples:
>>> uuid7 = generate_uuid()
>>> is_valid_uuid7(uuid7)
True
>>> from uuid import uuid4
>>> uuid4_val = uuid4()
>>> is_valid_uuid7(uuid4_val)
False
timestamps
¶
ISO 8601 timestamp utilities for bead package.
This module provides functions for creating, parsing, and formatting ISO 8601 timestamps with timezone information. All timestamps use UTC timezone.
now_iso8601() -> datetime
¶
Get current UTC datetime with timezone information.
Returns the current time in UTC with timezone info attached. This is preferred over datetime.utcnow() which is deprecated and doesn't include timezone information.
Returns:
| Type | Description |
|---|---|
datetime
|
Current UTC datetime with timezone information |
Examples:
>>> dt = now_iso8601()
>>> dt.tzinfo is not None
True
>>> dt.tzinfo == UTC
True
parse_iso8601(timestamp: str) -> datetime
¶
Parse ISO 8601 timestamp string to datetime.
Parses an ISO 8601 formatted string into a datetime object. The string should include timezone information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timestamp
|
str
|
ISO 8601 formatted timestamp string (e.g., "2025-10-17T14:23:45.123456+00:00"). |
required |
Returns:
| Type | Description |
|---|---|
datetime
|
Parsed datetime with timezone information |
Examples:
>>> dt_str = "2025-10-17T14:23:45.123456+00:00"
>>> dt = parse_iso8601(dt_str)
>>> dt.year
2025
>>> dt.month
10
format_iso8601(dt: datetime) -> str
¶
Format datetime as ISO 8601 string.
Converts a datetime object to an ISO 8601 formatted string. If the datetime doesn't have timezone information, it will be assumed to be UTC.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dt
|
datetime
|
Datetime to format. |
required |
Returns:
| Type | Description |
|---|---|
str
|
ISO 8601 formatted string |
Examples:
>>> dt = now_iso8601()
>>> formatted = format_iso8601(dt)
>>> "+00:00" in formatted or "Z" in formatted
True
Metadata and Validation¶
metadata
¶
Metadata tracking models for provenance and processing history.
Tracks provenance chains and processing history for full data lineage.
Models are frozen; updates return new instances through pure with_*
methods.
ProvenanceRecord
¶
Bases: BeadBaseModel
A single parent-child relationship in a provenance chain.
Attributes:
| Name | Type | Description |
|---|---|---|
parent_id |
UUID
|
UUID of the parent object. |
parent_type |
str
|
Type name of the parent object (e.g. "LexicalItem"). |
relationship |
str
|
Nature of the relationship (e.g. "derived_from"). |
timestamp |
datetime
|
When this relationship was established. |
ProcessingRecord
¶
Bases: BeadBaseModel
A single processing operation in an object's history.
Attributes:
| Name | Type | Description |
|---|---|---|
operation |
str
|
Name of the operation. |
parameters |
dict[str, JsonValue]
|
Parameters passed to the operation. |
timestamp |
datetime
|
When the operation was performed. |
operator |
str | None
|
Identity of the agent that performed the operation. |
MetadataTracker
¶
Bases: BeadBaseModel
Frozen tracker for provenance and processing history.
Attributes:
| Name | Type | Description |
|---|---|---|
provenance |
tuple[ProvenanceRecord, ...]
|
Provenance relationships in insertion order. |
processing_history |
tuple[ProcessingRecord, ...]
|
Processing operations in chronological order. |
custom_metadata |
dict[str, JsonValue]
|
Custom annotations. |
Examples:
>>> from uuid import uuid4
>>> tracker = MetadataTracker()
>>> parent_id = uuid4()
>>> tracker = tracker.with_provenance(parent_id, "Template", "filled_from")
>>> tracker = tracker.with_processing("fill_template", {"strategy": "exhaustive"})
>>> len(tracker.provenance)
1
>>> len(tracker.processing_history)
1
with_provenance(parent_id: UUID, parent_type: str, relationship: str) -> Self
¶
Return a new tracker with one additional provenance record.
with_processing(operation: str, parameters: dict[str, JsonValue] | None = None, operator: str | None = None) -> Self
¶
Return a new tracker with one additional processing record.
get_provenance_chain() -> tuple[UUID, ...]
¶
Return the parent UUIDs of every provenance record in order.
get_recent_processing(n: int = 5) -> tuple[ProcessingRecord, ...]
¶
Return the n most recent processing records, newest first.
validation
¶
Validation utilities for data integrity checks.
Provides functions beyond didactic's built-in validation, including JSONLines-file validation, UUID-reference validation, and provenance-chain validation.
ValidationReport
¶
Bases: Model
A frozen report of validation results.
Attributes:
| Name | Type | Description |
|---|---|---|
valid |
bool
|
Overall validation status. Set to |
errors |
tuple[str, ...]
|
Error messages. |
warnings |
tuple[str, ...]
|
Warning messages. |
object_count |
int
|
Number of objects validated. |
Examples:
>>> report = ValidationReport(valid=True)
>>> report = report.add_error("Invalid field")
>>> report.valid
False
>>> report.has_errors()
True
>>> len(report.errors)
1
add_error(message: str) -> Self
¶
Return a new report with message appended and valid=False.
add_warning(message: str) -> Self
¶
Return a new report with message appended to warnings.
has_errors() -> bool
¶
Return whether the report contains any errors.
has_warnings() -> bool
¶
Return whether the report contains any warnings.
validate_jsonlines_file(path: Path, model_class: type[dx.Model], strict: bool = True) -> ValidationReport
¶
Validate every line of path against model_class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the JSONLines file. |
required |
model_class
|
type[Model]
|
didactic Model class to validate against. |
required |
strict
|
bool
|
If |
True
|
Returns:
| Type | Description |
|---|---|
ValidationReport
|
Report containing the collected errors and the count of validated records. |
validate_uuid_references(objects: Sequence[dx.Model], reference_pool: Mapping[UUID, dx.Model]) -> ValidationReport
¶
Verify every UUID-typed field in objects points into reference_pool.
Supports single UUID fields and tuple/list-of-UUID fields. The
object's own id attribute is excluded from the check.
validate_provenance_chain(metadata: MetadataTracker, repository: Mapping[UUID, dx.Model]) -> ValidationReport
¶
Validate every parent reference in metadata's provenance chain.
Serialization¶
serialization
¶
JSONLines serialization utilities for didactic Models.
Functions for reading, writing, streaming, and appending didactic Models to and from JSONLines format files.
SerializationError
¶
Bases: Exception
Raised when serialization to JSONLines fails.
DeserializationError
¶
Bases: Exception
Raised when deserialization from JSONLines fails.
write_jsonlines(objects: Sequence[T], path: Path | str, validate: bool = True, append: bool = False) -> None
¶
Write objects to path as JSONLines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
objects
|
Sequence[T]
|
Models to serialize. |
required |
path
|
Path | str
|
Output file path. |
required |
validate
|
bool
|
Unused; retained for API compatibility. |
True
|
append
|
bool
|
Whether to append to an existing file. |
False
|
Raises:
| Type | Description |
|---|---|
SerializationError
|
If writing fails. |
read_jsonlines(path: Path | str, model_class: type[T], validate: bool = True, skip_errors: bool = False) -> list[T]
¶
Read JSONLines from path into a list of model_class instances.
stream_jsonlines(path: Path | str, model_class: type[T], validate: bool = True) -> Iterator[T]
¶
Yield model_class instances from path one line at a time.
append_jsonlines(objects: Sequence[T], path: Path | str, validate: bool = True) -> None
¶
Append objects to path as JSONLines.
Utilities¶
language_codes
¶
ISO 639 language code validation and utilities.
validate_iso639_code(code: str | None) -> str | None
¶
Validate language code against ISO 639-1 or ISO 639-3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
code
|
str | None
|
Language code to validate (e.g., "en", "eng", "ko", "kor"). |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
Normalized language code (converted to ISO 639-3 if valid). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If code is not a valid ISO 639 language code. |
Examples:
>>> validate_iso639_code("en")
'eng'
>>> validate_iso639_code("eng")
'eng'
>>> validate_iso639_code("ko")
'kor'
>>> validate_iso639_code(None)
None
>>> validate_iso639_code("invalid")
Traceback (most recent call last):
...
ValueError: Invalid language code: 'invalid'
repository
¶
Repository pattern for didactic Models with optional caching.
A generic repository over a JSONLines file plus an optional in-memory cache.
Mutations to a stored Model produce a new instance (didactic Models are
frozen); the repository writes the new instance back via update.
Repository
¶
Generic CRUD repository for didactic Models persisted as JSONLines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_class
|
type[T]
|
The didactic Model class this repository manages. |
required |
storage_path
|
Path
|
Path to the JSONLines file for persistent storage. |
required |
use_cache
|
bool
|
Whether to use the in-memory cache. |
True
|
get(object_id: UUID) -> T | None
¶
Return the object with object_id if present, else None.
get_all() -> list[T]
¶
Return every object in the repository.
add(obj: T) -> None
¶
Append obj to storage and update the cache.
add_many(objects: list[T]) -> None
¶
Append every object in objects to storage and update the cache.
update(obj: T) -> None
¶
Replace the stored object with the same id by obj.
delete(object_id: UUID) -> None
¶
Remove the object with object_id from storage.
exists(object_id: UUID) -> bool
¶
Return whether an object with object_id exists.
count() -> int
¶
Return the number of objects in the repository.
clear() -> None
¶
Drop every object and delete the storage file.
rebuild_cache() -> None
¶
Reload the cache from storage.