bead.data¶
Core data models and utilities used throughout the bead pipeline.
Base Models¶
base
¶
Base Pydantic model for all bead objects.
This module provides BeadBaseModel, the foundational Pydantic v2 model that all bead data models should inherit from. It provides automatic ID generation, timestamp tracking, and versioning.
BeadBaseModel
¶
Bases: BaseModel
Base Pydantic model for all bead objects.
This model provides foundational fields and configuration that all bead data models inherit. It includes automatic ID generation using UUIDv7, timestamp tracking for creation and modification, versioning, and metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
id |
UUID
|
Unique identifier (UUIDv7) automatically generated on creation |
created_at |
datetime
|
UTC timestamp when object was created |
modified_at |
datetime
|
UTC timestamp when object was last modified |
version |
str
|
Version string for schema versioning (default: "1.0.0") |
metadata |
dict[str, JsonValue]
|
Optional metadata dictionary for arbitrary key-value pairs |
Examples:
>>> class MyModel(BeadBaseModel):
... name: str
... value: int
>>> obj = MyModel(name="test", value=42)
>>> obj.id
UUID('...')
>>> obj.version
'1.0.0'
>>> obj.update_modified_time()
>>> obj.modified_at > obj.created_at
True
update_modified_time() -> None
¶
Update the modified_at timestamp to current UTC time.
This method should be called whenever the object is modified to maintain accurate modification tracking.
Examples:
Identifiers and Timestamps¶
identifiers
¶
UUIDv7 generation and utilities for bead package.
This module provides functions for generating time-ordered UUIDv7 identifiers, extracting timestamps from them, and validating UUID versions.
generate_uuid() -> UUID
¶
Generate a time-ordered UUIDv7.
UUIDv7 is a time-ordered UUID format that embeds a timestamp in the first 48 bits, making UUIDs sortable by creation time. This is useful for maintaining chronological ordering of database records.
Returns:
| Type | Description |
|---|---|
UUID
|
A newly generated UUIDv7 with embedded timestamp |
Examples:
extract_timestamp(uuid: UUID) -> int
¶
Extract timestamp in milliseconds from a UUIDv7.
The timestamp is stored in the first 48 bits of the UUID and represents milliseconds since Unix epoch (January 1, 1970 00:00:00 UTC).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uuid
|
UUID
|
The UUIDv7 to extract timestamp from. |
required |
Returns:
| Type | Description |
|---|---|
int
|
Timestamp in milliseconds since Unix epoch |
Examples:
is_valid_uuid7(uuid: UUID) -> bool
¶
Check if a UUID is a valid UUIDv7.
Validates that the UUID has version 7 by checking the version bits (bits 48-51) which should be 0111 (7).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uuid
|
UUID
|
The UUID to validate. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the UUID is version 7, False otherwise |
Examples:
timestamps
¶
ISO 8601 timestamp utilities for bead package.
This module provides functions for creating, parsing, and formatting ISO 8601 timestamps with timezone information. All timestamps use UTC timezone.
now_iso8601() -> datetime
¶
Get current UTC datetime with timezone information.
Returns the current time in UTC with timezone info attached. This is preferred over datetime.utcnow() which is deprecated and doesn't include timezone information.
Returns:
| Type | Description |
|---|---|
datetime
|
Current UTC datetime with timezone information |
Examples:
parse_iso8601(timestamp: str) -> datetime
¶
Parse ISO 8601 timestamp string to datetime.
Parses an ISO 8601 formatted string into a datetime object. The string should include timezone information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timestamp
|
str
|
ISO 8601 formatted timestamp string (e.g., "2025-10-17T14:23:45.123456+00:00"). |
required |
Returns:
| Type | Description |
|---|---|
datetime
|
Parsed datetime with timezone information |
Examples:
format_iso8601(dt: datetime) -> str
¶
Format datetime as ISO 8601 string.
Converts a datetime object to an ISO 8601 formatted string. If the datetime doesn't have timezone information, it will be assumed to be UTC.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dt
|
datetime
|
Datetime to format. |
required |
Returns:
| Type | Description |
|---|---|
str
|
ISO 8601 formatted string |
Examples:
Metadata and Validation¶
metadata
¶
Metadata tracking models for provenance and processing history.
This module provides models for tracking provenance chains and processing history for all bead objects. This enables full traceability of data transformations.
ProvenanceRecord
¶
Bases: BeadBaseModel
Record of a provenance relationship between objects.
Tracks a single parent-child relationship in the provenance chain, including what the parent was, its type, and the nature of the relationship.
Attributes:
| Name | Type | Description |
|---|---|---|
parent_id |
UUID
|
UUID of the parent object in the provenance chain |
parent_type |
str
|
Type name of the parent object (e.g., "LexicalItem", "Template") |
relationship |
str
|
Type of relationship (e.g., "derived_from", "filled_from", "generated_from") |
timestamp |
datetime
|
When this relationship was established (UTC with timezone) |
Examples:
>>> from uuid import uuid4
>>> parent_id = uuid4()
>>> record = ProvenanceRecord(
... parent_id=parent_id,
... parent_type="Template",
... relationship="filled_from"
... )
>>> record.parent_type
'Template'
>>> record.timestamp is not None
True
ProcessingRecord
¶
Bases: BeadBaseModel
Record of a processing operation applied to an object.
Tracks a single operation in the processing history, including the operation name, parameters used, when it was performed, and who/what performed it.
Attributes:
| Name | Type | Description |
|---|---|---|
operation |
str
|
Name of the operation (e.g., "fill_template", "apply_constraint", "filter") |
parameters |
dict[str, JsonValue]
|
Parameters passed to the operation (default: empty dict) |
timestamp |
datetime
|
When the operation was performed (UTC with timezone) |
operator |
str | None
|
Who/what performed the operation (e.g., "TemplateFiller-v1.0", user ID) (default: None) |
Examples:
>>> record = ProcessingRecord(
... operation="fill_template",
... parameters={"strategy": "exhaustive", "max_items": 100},
... operator="TemplateFiller-v1.0"
... )
>>> record.operation
'fill_template'
>>> record.parameters["strategy"]
'exhaustive'
>>> record.timestamp is not None
True
MetadataTracker
¶
Bases: BeadBaseModel
Metadata tracking for provenance and processing history.
Tracks both provenance (where data came from) and processing history (what operations were applied) for complete data lineage.
Attributes:
| Name | Type | Description |
|---|---|---|
provenance |
list[ProvenanceRecord]
|
Chain of provenance relationships (default: empty list) |
processing_history |
list[ProcessingRecord]
|
History of processing operations (default: empty list) |
custom_metadata |
dict[str, JsonValue]
|
Custom metadata fields (default: empty dict) |
Examples:
>>> from uuid import uuid4
>>> tracker = MetadataTracker()
>>> parent_id = uuid4()
>>> tracker.add_provenance(parent_id, "Template", "filled_from")
>>> tracker.add_processing("fill_template", {"strategy": "exhaustive"})
>>> len(tracker.provenance)
1
>>> len(tracker.processing_history)
1
>>> chain = tracker.get_provenance_chain()
>>> len(chain)
1
add_provenance(parent_id: UUID, parent_type: str, relationship: str) -> None
¶
Add a provenance record to the chain.
Creates a new provenance record and adds it to the provenance list. The timestamp is automatically set to the current time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parent_id
|
UUID
|
UUID of the parent object |
required |
parent_type
|
str
|
Type name of the parent object (e.g., "Template", "LexicalItem") |
required |
relationship
|
str
|
Type of relationship (e.g., "derived_from", "filled_from") |
required |
Examples:
add_processing(operation: str, parameters: dict[str, JsonValue] | None = None, operator: str | None = None) -> None
¶
Add a processing record to the history.
Creates a new processing record and adds it to the processing history. The timestamp is automatically set to the current time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
operation
|
str
|
Name of the operation performed |
required |
parameters
|
dict[str, JsonValue] | None
|
Parameters passed to the operation (default: None, which creates empty dict) |
None
|
operator
|
str | None
|
Who/what performed the operation (default: None) |
None
|
Examples:
>>> tracker = MetadataTracker()
>>> tracker.add_processing("fill_template", {"strategy": "exhaustive"})
>>> len(tracker.processing_history)
1
>>> tracker.processing_history[0].operation
'fill_template'
>>> tracker.add_processing("filter", operator="FilterSystem-v2.0")
>>> tracker.processing_history[1].operator
'FilterSystem-v2.0'
get_provenance_chain() -> list[UUID]
¶
Get the full provenance chain as a list of parent UUIDs.
Returns the parent UUIDs in the order they were added to the provenance list.
Returns:
| Type | Description |
|---|---|
list[UUID]
|
List of parent UUIDs in chronological order |
Examples:
>>> from uuid import uuid4
>>> tracker = MetadataTracker()
>>> parent1 = uuid4()
>>> parent2 = uuid4()
>>> tracker.add_provenance(parent1, "Template", "filled_from")
>>> tracker.add_provenance(parent2, "LexicalItem", "derived_from")
>>> chain = tracker.get_provenance_chain()
>>> len(chain)
2
>>> chain[0] == parent1
True
get_recent_processing(n: int = 5) -> list[ProcessingRecord]
¶
Get the N most recent processing records.
Returns the most recent processing records, up to N records. If there are fewer than N records, returns all available records.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of recent records to return (default: 5) |
5
|
Returns:
| Type | Description |
|---|---|
list[ProcessingRecord]
|
List of up to N most recent processing records, newest first |
Examples:
validation
¶
Validation utilities for data integrity checks.
This module provides validation functions beyond Pydantic's built-in validation, including file validation, reference validation, and provenance chain validation.
ValidationReport
¶
Bases: BaseModel
Report of validation results.
A lightweight model for collecting and reporting validation results, including errors, warnings, and statistics about validated objects.
Attributes:
| Name | Type | Description |
|---|---|---|
valid |
bool
|
Overall validation status (False if any errors) |
errors |
list[str]
|
List of error messages (default: empty list) |
warnings |
list[str]
|
List of warning messages (default: empty list) |
object_count |
int
|
Number of objects validated (default: 0) |
Examples:
>>> report = ValidationReport(valid=True)
>>> report.add_error("Invalid field")
>>> report.valid
False
>>> report.has_errors()
True
>>> len(report.errors)
1
validate_jsonlines_file(path: Path, model_class: type[BaseModel], strict: bool = True) -> ValidationReport
¶
Validate JSONLines file against Pydantic model schema.
Reads and validates each line in a JSONLines file against the provided model class. Empty lines are skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to JSONLines file to validate. |
required |
model_class
|
type[BaseModel]
|
Pydantic model class to validate against. |
required |
strict
|
bool
|
If True, stop at first error. If False, collect all errors (default: True). |
True
|
Returns:
| Type | Description |
|---|---|
ValidationReport
|
Validation report with results |
Examples:
validate_uuid_references(objects: list[BaseModel], reference_pool: dict[UUID, BaseModel]) -> ValidationReport
¶
Validate that UUID references point to existing objects.
Checks all UUID fields in objects to ensure they reference valid objects in the reference pool. Supports both single UUID fields and list[UUID] fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
objects
|
list[BaseModel]
|
List of objects to validate. |
required |
reference_pool
|
dict[UUID, BaseModel]
|
Dictionary of valid UUIDs to objects. |
required |
Returns:
| Type | Description |
|---|---|
ValidationReport
|
Validation report with results |
Examples:
validate_provenance_chain(metadata: MetadataTracker, repository: dict[UUID, BaseModel]) -> ValidationReport
¶
Validate provenance chain references are valid.
Checks that all parent_id references in the provenance chain exist in the repository and that parent_type matches the actual type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
MetadataTracker
|
Metadata tracker with provenance chain to validate. |
required |
repository
|
dict[UUID, BaseModel]
|
Dictionary of valid UUIDs to objects. |
required |
Returns:
| Type | Description |
|---|---|
ValidationReport
|
Validation report with results |
Examples:
>>> from uuid import uuid4
>>> from bead.data.base import BeadBaseModel
>>> from bead.data.metadata import MetadataTracker
>>> class Template(BeadBaseModel):
... name: str
>>> template = Template(name="test")
>>> metadata = MetadataTracker()
>>> metadata.add_provenance(template.id, "Template", "filled_from")
>>> repo = {template.id: template}
>>> report = validate_provenance_chain(metadata, repo)
>>> report.valid
True
Serialization¶
serialization
¶
JSONLines serialization utilities for bead package.
This module provides functions for reading, writing, streaming, and appending Pydantic models to/from JSONLines format files. JSONLines is a convenient format for storing multiple JSON objects, with one object per line.
SerializationError
¶
Bases: Exception
Exception raised when serialization to JSONLines fails.
This exception is raised when writing Pydantic objects to JSONLines format encounters an error, such as file I/O issues or validation failures.
DeserializationError
¶
Bases: Exception
Exception raised when deserialization from JSONLines fails.
This exception is raised when reading JSONLines format into Pydantic objects encounters an error, such as file not found, invalid JSON, or validation failures.
write_jsonlines(objects: Sequence[T], path: Path | str, validate: bool = True, append: bool = False) -> None
¶
Write Pydantic objects to JSONLines file.
Serializes a sequence of Pydantic model instances to a JSONLines file, with one JSON object per line. Each object is validated before writing if validate=True.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
objects
|
Sequence[T]
|
Sequence of Pydantic model instances to serialize. |
required |
path
|
Path | str
|
Path to the output file. |
required |
validate
|
bool
|
Whether to validate objects before writing (default: True). |
True
|
append
|
bool
|
Whether to append to existing file or overwrite (default: False). |
False
|
Raises:
| Type | Description |
|---|---|
SerializationError
|
If writing fails due to I/O error or validation failure |
Examples:
read_jsonlines(path: Path | str, model_class: type[T], validate: bool = True, skip_errors: bool = False) -> list[T]
¶
Read JSONLines file into list of Pydantic objects.
Deserializes a JSONLines file into a list of Pydantic model instances. Each line should contain a valid JSON object. Empty lines are skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the input file. |
required |
model_class
|
type[T]
|
Pydantic model class to deserialize into. |
required |
validate
|
bool
|
Whether to validate objects during parsing (default: True). |
True
|
skip_errors
|
bool
|
Whether to skip invalid lines or raise error (default: False). |
False
|
Returns:
| Type | Description |
|---|---|
list[T]
|
List of deserialized Pydantic objects |
Raises:
| Type | Description |
|---|---|
DeserializationError
|
If reading fails due to file not found, invalid JSON, or validation failure (unless skip_errors=True) |
Examples:
stream_jsonlines(path: Path | str, model_class: type[T], validate: bool = True) -> Iterator[T]
¶
Stream JSONLines file as iterator of Pydantic objects.
Memory-efficient iterator that yields Pydantic model instances one at a time from a JSONLines file. Useful for processing large files without loading everything into memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the input file. |
required |
model_class
|
type[T]
|
Pydantic model class to deserialize into. |
required |
validate
|
bool
|
Whether to validate objects during parsing (default: True). |
True
|
Yields:
| Type | Description |
|---|---|
T
|
Pydantic model instances one at a time. |
Raises:
| Type | Description |
|---|---|
DeserializationError
|
If reading fails due to file not found, invalid JSON, or validation failure |
Examples:
append_jsonlines(objects: Sequence[T], path: Path | str, validate: bool = True) -> None
¶
Append Pydantic objects to existing JSONLines file.
Convenience wrapper around write_jsonlines with append=True. Adds objects to the end of an existing JSONLines file, or creates a new file if it doesn't exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
objects
|
Sequence[T]
|
Sequence of Pydantic model instances to serialize. |
required |
path
|
Path | str
|
Path to the output file. |
required |
validate
|
bool
|
Whether to validate objects before writing (default: True). |
True
|
Raises:
| Type | Description |
|---|---|
SerializationError
|
If appending fails due to I/O error or validation failure |
Examples:
Utilities¶
language_codes
¶
ISO 639 language code validation and utilities.
validate_iso639_code(code: str | None) -> str | None
¶
Validate language code against ISO 639-1 or ISO 639-3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
code
|
str | None
|
Language code to validate (e.g., "en", "eng", "ko", "kor"). |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
Normalized language code (converted to ISO 639-3 if valid). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If code is not a valid ISO 639 language code. |
Examples:
repository
¶
Repository pattern for data access with optional caching.
This module provides a generic Repository class that implements CRUD operations for Pydantic models, with optional in-memory caching for efficient access.
Repository
¶
Generic repository for CRUD operations on Pydantic models.
Provides create, read, update, delete operations with JSONLines file storage and optional in-memory caching for efficient data access.
Class Type Parameters:
| Name | Bound or Constraints | Description | Default |
|---|---|---|---|
T
|
BaseModel
|
Pydantic model type this repository manages |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_class
|
type[T]
|
The Pydantic model class this repository manages |
required |
storage_path
|
Path
|
Path to the JSONLines file for persistent storage |
required |
use_cache
|
bool
|
Whether to use in-memory caching (default: True) |
True
|
Attributes:
| Name | Type | Description |
|---|---|---|
model_class |
type[T]
|
The Pydantic model class |
storage_path |
Path
|
Path to storage file |
use_cache |
bool
|
Whether caching is enabled |
cache |
dict[UUID, T]
|
In-memory cache of objects by ID |
Examples:
>>> from pathlib import Path
>>> from bead.data.base import BeadBaseModel
>>> class MyModel(BeadBaseModel):
... name: str
>>> repo = Repository[MyModel](
... model_class=MyModel,
... storage_path=Path("data/models.jsonl"),
... use_cache=True
... )
>>> obj = MyModel(name="test")
>>> repo.add(obj)
>>> loaded = repo.get(obj.id)
>>> loaded.name
'test'
>>> repo.count()
1
get(object_id: UUID) -> T | None
¶
Get object by ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
object_id
|
UUID
|
ID of the object to retrieve. |
required |
Returns:
| Type | Description |
|---|---|
T | None
|
The object if found, None otherwise. |
Examples:
get_all() -> list[T]
¶
add(obj: T) -> None
¶
Add single object to repository.
Appends the object to the storage file and updates cache if enabled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
obj
|
T
|
Object to add. |
required |
Examples:
add_many(objects: list[T]) -> None
¶
Add multiple objects to repository.
Appends all objects to the storage file and updates cache if enabled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
objects
|
list[T]
|
List of objects to add. |
required |
Examples:
update(obj: T) -> None
¶
Update existing object.
Rewrites the entire storage file with the updated object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
obj
|
T
|
Object to update (must have existing ID). |
required |
Examples:
delete(object_id: UUID) -> None
¶
Delete object by ID.
Rewrites the entire storage file without the deleted object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
object_id
|
UUID
|
ID of object to delete. |
required |
Examples:
exists(object_id: UUID) -> bool
¶
Check if object exists.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
object_id
|
UUID
|
ID of object to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if object exists, False otherwise. |
Examples: