bead.transforms¶
Value-level text transforms (str -> str, parameterised by a
TransformContext) used when rendering template slots and item prompts.
Transforms are registered by name in a TransformRegistry; any callable
conforming to the SpanTextTransform protocol can be registered.
Core Abstractions¶
base
¶
Core abstractions for the span text transform system.
Defines the :class:SpanTextTransform protocol, :class:TransformContext
for passing metadata to transforms, :class:TransformPipeline for
composing transforms, and :class:TransformRegistry for name-based
lookup.
The transforms operate at the value level (str -> str parameterised
by a TransformContext). Use dx.Iso or dx.Lens directly when
the transformation crosses schema boundaries.
TransformContext
¶
Bases: BeadBaseModel
Metadata available to transforms at resolution time.
Attributes:
| Name | Type | Description |
|---|---|---|
language_code |
str | None
|
ISO 639 code (e.g. |
lemma |
str | None
|
Lemma of the span head, if known. |
pos |
str | None
|
Universal POS tag of the span head (e.g. |
head_index |
int | None
|
Token index of the syntactic head within the span. |
tokens |
tuple[str, ...]
|
Individual tokens of the span text. Empty when unknown. |
metadata |
dict[str, JsonValue]
|
Arbitrary extra metadata. |
SpanTextTransform
¶
Bases: Protocol
Protocol for a single text transform.
Any callable (str, TransformContext) -> str satisfies this
protocol. Implementations may ignore the context when the transform
is purely textual (e.g. lowercasing).
__call__(text: str, context: TransformContext) -> str
¶
Apply the transform to text.
TransformPipeline
¶
An ordered chain of transforms applied left-to-right.
Examples:
>>> from bead.transforms.text import LowerTransform, CapitalizeTransform
>>> ctx = TransformContext()
>>> pipe = TransformPipeline([LowerTransform(), CapitalizeTransform()])
>>> pipe("HELLO WORLD", ctx)
'Hello world'
__call__(text: str, context: TransformContext) -> str
¶
Apply each transform in sequence.
__len__() -> int
¶
Return the number of transforms in the pipeline.
__repr__() -> str
¶
Return a debug-friendly representation of the pipeline.
append(transform: SpanTextTransform) -> None
¶
Append a transform to the end of the pipeline.
prepend(transform: SpanTextTransform) -> None
¶
Insert a transform at the beginning of the pipeline.
TransformRegistry
¶
Name-to-transform mapping with pipeline construction.
Transforms are registered under short string names (e.g.
"gerund", "lower") and looked up when resolving
[[label|name1|name2]] prompt references.
Examples:
>>> from bead.transforms.text import LowerTransform
>>> reg = TransformRegistry()
>>> reg.register("lower", LowerTransform())
>>> t = reg.get("lower")
>>> t("HELLO", TransformContext())
'hello'
register(name: str, transform: SpanTextTransform | Callable[[str, TransformContext], str]) -> None
¶
Register transform under name (case-insensitive).
get(name: str) -> SpanTextTransform
¶
Return the transform registered under name.
Raises:
| Type | Description |
|---|---|
KeyError
|
If no transform with that name exists. |
resolve_pipeline(names: list[str]) -> TransformPipeline
¶
Return a pipeline applying the named transforms left-to-right.
available() -> list[str]
¶
Return the registered transform names, sorted.
__contains__(name: str) -> bool
¶
Return whether name is registered.
__len__() -> int
¶
Return the number of registered transforms.
__repr__() -> str
¶
Return a debug-friendly representation of the registry.
Text Transforms¶
Pure surface-string transforms. In addition to case transforms (lower,
upper, capitalize, title), this module provides MarkdownStripTransform
and RedditCleanupTransform for cleaning web/markdown text into plain prose,
and split_sentences for sentence segmentation (parser-backed when a
spaCy/Stanza config is given, with a regular-expression fallback otherwise).
text
¶
Pure text transforms that require no external resources.
These transforms operate on the surface string and ignore the
:class:TransformContext. They are always safe to register
regardless of language.
LowerTransform
¶
Convert text to lowercase.
Examples:
>>> LowerTransform()("Hello World", TransformContext())
'hello world'
__call__(text: str, context: TransformContext) -> str
¶
Apply str.lower to text.
UpperTransform
¶
Convert text to uppercase.
Examples:
>>> UpperTransform()("Hello World", TransformContext())
'HELLO WORLD'
__call__(text: str, context: TransformContext) -> str
¶
Apply str.upper to text.
CapitalizeTransform
¶
Capitalize the first character, lowercase the rest.
Examples:
>>> CapitalizeTransform()("hELLO WORLD", TransformContext())
'Hello world'
__call__(text: str, context: TransformContext) -> str
¶
Apply str.capitalize to text.
TitleTransform
¶
Title-case each word.
Examples:
>>> TitleTransform()("hello world", TransformContext())
'Hello World'
__call__(text: str, context: TransformContext) -> str
¶
Apply str.title to text.
MarkdownStripTransform
¶
Strip common Markdown markup, keeping the human-readable text.
Removes link/image targets (keeping the visible text), emphasis markers, inline code backticks, heading markers, and blockquote markers.
Examples:
>>> MarkdownStripTransform()("**bold** and [a link](http://x)", TransformContext())
'bold and a link'
__call__(text: str, context: TransformContext) -> str
¶
Strip Markdown markup from text.
RedditCleanupTransform
¶
Clean Reddit comment text into plain prose.
Unescapes HTML entities, strips Markdown (reusing
:class:MarkdownStripTransform), removes URLs and [deleted]/
[removed] markers, and collapses runs of intra-line whitespace.
Examples:
>>> RedditCleanupTransform()("see [here](http://x) & more", TransformContext())
'see here & more'
__call__(text: str, context: TransformContext) -> str
¶
Clean Reddit markup from text.
split_sentences(text: str, *, tokenizer_config: TokenizerConfig | None = None) -> tuple[str, ...]
¶
Split text into sentences.
When tokenizer_config selects a spacy or stanza backend, sentence
boundaries come from that parser's segmenter. Otherwise a regular-expression
fallback splits on sentence-final punctuation followed by whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to split. |
required |
tokenizer_config
|
TokenizerConfig | None
|
Backend selector. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[str, ...]
|
The sentences, with surrounding whitespace stripped (empties dropped). |
Morphological Transforms¶
morphology
¶
Morphological transforms backed by UniMorph paradigms.
Each :class:MorphologicalTransform targets a specific inflectional
feature bundle (e.g. present participle) and applies the inflection to
the head token of the span text. Non-head tokens are preserved as-is,
producing natural multi-word results like "running to the store"
from a span "run to the store" with a gerund transform.
The system is language-agnostic at the protocol level: the same
:class:MorphologicalTransform class works for any language supported
by UniMorph — the language is selected via language_code at
construction time.
InflectionSpec
dataclass
¶
Specification for a target inflectional form.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Human-readable name (e.g. |
predicate |
FeaturePredicate
|
A callable that returns |
description |
str
|
Short description of the inflection. |
MorphologicalTransform
¶
Apply a morphological inflection to the head token of span text.
Given a span like "run to the store" and an inflection spec
for the present participle, this transform produces "running to
the store" by inflecting only the head token (defaulting to
the first token when context.head_index is not set).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inflection_spec
|
InflectionSpec
|
Specifies which inflected form to target. |
required |
language_code
|
str
|
ISO 639 language code for UniMorph lookup. |
required |
lemmatize
|
bool
|
If |
True
|
Examples:
>>> spec = InflectionSpec(
... name="gerund",
... predicate=lambda f: (
... f.get("verb_form") == "V.PTCP" and f.get("tense") == "PRS"
... ),
... )
>>> t = MorphologicalTransform(spec, language_code="eng")
>>> ctx = TransformContext(
... lemma="run", head_index=0, tokens=["run", "to", "the", "store"]
... )
>>> t("run to the store", ctx)
'running to the store'
inflection_spec: InflectionSpec
property
¶
The inflection specification for this transform.
__call__(text: str, context: TransformContext) -> str
¶
Apply the inflection to the span text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The span text to transform. |
required |
context
|
TransformContext
|
Metadata about the span (lemma, head_index, etc.). |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text with the head token inflected. Falls back to the original text if the inflection cannot be resolved. |
__repr__() -> str
¶
Return a debug representation.
register_morphological_transforms(registry: TransformRegistry, language_code: str) -> None
¶
Register standard morphological transforms for a language.
Adds gerund, past_tense, past_participle,
present_3sg, and infinitive transforms backed by UniMorph.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
registry
|
TransformRegistry
|
Registry to populate. |
required |
language_code
|
str
|
ISO 639 language code. |
required |