bead.tokenization¶

Configurable multilingual tokenization for span annotation and UI display.

Configuration¶

`config` ¶

Tokenizer configuration model.

Aligned with the existing ChunkingSpec pattern in bead.items.item_template, which already supports parser: Literal["stanza", "spacy"].

`TokenizerConfig` ¶

Bases: BaseModel

Configuration for display-level tokenization.

Controls how text is split into word-level tokens for span annotation and UI display. Supports multiple NLP backends for multilingual coverage.

Attributes:

Name	Type	Description
`backend`	`TokenizerBackend`	Tokenization backend to use. "spacy" (default) supports 49+ languages and is fast and production-grade. "stanza" supports 80+ languages with better coverage for low-resource and morphologically rich languages. "whitespace" is a simple fallback for pre-tokenized text.
`language`	`str`	ISO 639 language code (e.g. "en", "zh", "de", "ar").
`model_name`	`str \| None`	Explicit model name (e.g. "en_core_web_sm", "zh_core_web_sm"). When None, auto-resolved from language and backend.

Tokenizers¶

`tokenizers` ¶

Concrete tokenizer implementations.

Provides display-level tokenizers for span annotation. Each tokenizer converts raw text into a sequence of DisplayToken objects that carry rendering metadata (space_after) for artifact-free reconstruction.

`DisplayToken` ¶

Bases: BaseModel

A word-level token with rendering metadata.

Attributes:

Name	Type	Description
`text`	`str`	The token text.
`space_after`	`bool`	Whether whitespace follows this token in the original text.
`start_char`	`int`	Character offset of the token start in the original text.
`end_char`	`int`	Character offset of the token end in the original text.

`TokenizedText` ¶

Bases: BaseModel

Result of display-level tokenization.

Attributes:

Name	Type	Description
`tokens`	`list[DisplayToken]`	The sequence of display tokens.
`original_text`	`str`	The original input text.

`token_texts: list[str]` `property` ¶

Plain token strings (for Item.tokenized_elements).

Returns:

Type	Description
`list[str]`	List of token text strings.

`space_after_flags: list[bool]` `property` ¶

Per-token space_after flags (for Item.token_space_after).

Returns:

Type	Description
`list[bool]`	List of boolean flags.

`render() -> str` ¶

Reconstruct display text from tokens with correct spacing.

Guarantees identical rendering to original when round-tripped.

Returns:

Type	Description
`str`	Reconstructed text.

`WhitespaceTokenizer` ¶

Simple whitespace-split tokenizer.

Fallback for pre-tokenized text or languages not supported by spaCy or Stanza. Splits on whitespace boundaries and infers space_after from the original character offsets.

`call(text: str) -> TokenizedText` ¶

Tokenize text by splitting on whitespace.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text.	required

Returns:

Type	Description
`TokenizedText`	Tokenized result.

`SpacyTokenizer` ¶

spaCy-based tokenizer.

Supports 49+ languages. Auto-resolves model from language code if model_name is not specified. Handles punctuation attachment and multi-word token (MWT) expansion correctly.

Parameters:

Name	Type	Description	Default
`language`	`str`	ISO 639 language code.	`'en'`
`model_name`	`str \| None`	Explicit spaCy model name. When None, uses `{language}_core_web_sm` for common languages, falling back to a blank model.	`None`

`call(text: str) -> TokenizedText` ¶

Tokenize text using spaCy.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text.	required

Returns:

Type	Description
`TokenizedText`	Tokenized result with correct `space_after` metadata.

`StanzaTokenizer` ¶

Stanza-based tokenizer.

Supports 80+ languages. Handles multi-word token (MWT) expansion for languages like German, French, and Arabic. Better coverage for low-resource and morphologically rich languages.

Parameters:

Name	Type	Description	Default
`language`	`str`	ISO 639 language code.	`'en'`
`model_name`	`str \| None`	Explicit Stanza model/package name. When None, uses the default package for the language.	`None`

`call(text: str) -> TokenizedText` ¶

Tokenize text using Stanza.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text.	required

Returns:

Type	Description
`TokenizedText`	Tokenized result with correct `space_after` metadata.

`create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText]` ¶

Return a tokenization function for the given config.

Lazy-loads the NLP backend (spaCy/Stanza) on first call.

Parameters:

Name	Type	Description	Default
`config`	`TokenizerConfig`	Tokenizer configuration.	required

Returns:

Type	Description
`Callable[[str], TokenizedText]`	A callable that tokenizes text.

Raises:

Type	Description
`ValueError`	If the backend is not recognized.

Display-to-Subword Alignment¶

`alignment` ¶

Alignment between display tokens and subword model tokens.

Maps display-token-level span indices to subword-token indices so that active learning models can consume span annotations created in display-token space.

`align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]]` ¶

Map each display token index to its corresponding subword token indices.

Parameters:

Name	Type	Description	Default
`display_tokens`	`list[str]`	Display-level token strings (word-level).	required
`subword_tokenizer`	`_PreTrainedTokenizerProtocol`	A HuggingFace-compatible tokenizer with `__call__` and `convert_ids_to_tokens` methods.	required

Returns:

Type	Description
`list[list[int]]`	A list where `entry[i]` is the list of subword token indices for display token `i`. Special tokens (CLS, SEP, etc.) are excluded.

`convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int]` ¶

Convert display-token span indices to subword-token indices.

Parameters:

Name	Type	Description	Default
`span_indices`	`list[int]`	Display-token indices forming the span.	required
`alignment`	`list[list[int]]`	Alignment from `align_display_to_subword`.	required

Returns:

Type	Description
`list[int]`	Corresponding subword-token indices.

Raises:

Type	Description
`IndexError`	If any span index is out of range of the alignment.

bead.tokenization¶

Configuration¶

config ¶

TokenizerConfig ¶

Tokenizers¶

tokenizers ¶

DisplayToken ¶

TokenizedText ¶

token_texts: list[str] property ¶

space_after_flags: list[bool] property ¶

render() -> str ¶

WhitespaceTokenizer ¶

__call__(text: str) -> TokenizedText ¶

SpacyTokenizer ¶

__call__(text: str) -> TokenizedText ¶

StanzaTokenizer ¶

__call__(text: str) -> TokenizedText ¶

create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText] ¶

Display-to-Subword Alignment¶

alignment ¶

align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]] ¶

convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int] ¶

`config` ¶

`TokenizerConfig` ¶

`tokenizers` ¶

`DisplayToken` ¶

`TokenizedText` ¶

`token_texts: list[str]` `property` ¶

`space_after_flags: list[bool]` `property` ¶

`render() -> str` ¶

`WhitespaceTokenizer` ¶

`call(text: str) -> TokenizedText` ¶

`SpacyTokenizer` ¶

`call(text: str) -> TokenizedText` ¶

`StanzaTokenizer` ¶

`call(text: str) -> TokenizedText` ¶

`create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText]` ¶

`alignment` ¶

`align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]]` ¶

`convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int]` ¶