Skip to content

bead.tokenization

Configurable multilingual tokenization for span annotation and UI display.

Configuration

config

Tokenizer configuration model.

Aligned with the existing ChunkingSpec pattern in bead.items.item_template, which already supports parser: Literal["stanza", "spacy"].

TokenizerConfig

Bases: BaseModel

Configuration for display-level tokenization.

Controls how text is split into word-level tokens for span annotation and UI display. Supports multiple NLP backends for multilingual coverage.

Attributes:

Name Type Description
backend TokenizerBackend

Tokenization backend to use. "spacy" (default) supports 49+ languages and is fast and production-grade. "stanza" supports 80+ languages with better coverage for low-resource and morphologically rich languages. "whitespace" is a simple fallback for pre-tokenized text.

language str

ISO 639 language code (e.g. "en", "zh", "de", "ar").

model_name str | None

Explicit model name (e.g. "en_core_web_sm", "zh_core_web_sm"). When None, auto-resolved from language and backend.

Tokenizers

tokenizers

Concrete tokenizer implementations.

Provides display-level tokenizers for span annotation. Each tokenizer converts raw text into a sequence of DisplayToken objects that carry rendering metadata (space_after) for artifact-free reconstruction.

DisplayToken

Bases: BaseModel

A word-level token with rendering metadata.

Attributes:

Name Type Description
text str

The token text.

space_after bool

Whether whitespace follows this token in the original text.

start_char int

Character offset of the token start in the original text.

end_char int

Character offset of the token end in the original text.

TokenizedText

Bases: BaseModel

Result of display-level tokenization.

Attributes:

Name Type Description
tokens list[DisplayToken]

The sequence of display tokens.

original_text str

The original input text.

token_texts: list[str] property

Plain token strings (for Item.tokenized_elements).

Returns:

Type Description
list[str]

List of token text strings.

space_after_flags: list[bool] property

Per-token space_after flags (for Item.token_space_after).

Returns:

Type Description
list[bool]

List of boolean flags.

render() -> str

Reconstruct display text from tokens with correct spacing.

Guarantees identical rendering to original when round-tripped.

Returns:

Type Description
str

Reconstructed text.

WhitespaceTokenizer

Simple whitespace-split tokenizer.

Fallback for pre-tokenized text or languages not supported by spaCy or Stanza. Splits on whitespace boundaries and infers space_after from the original character offsets.

__call__(text: str) -> TokenizedText

Tokenize text by splitting on whitespace.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
TokenizedText

Tokenized result.

SpacyTokenizer

spaCy-based tokenizer.

Supports 49+ languages. Auto-resolves model from language code if model_name is not specified. Handles punctuation attachment and multi-word token (MWT) expansion correctly.

Parameters:

Name Type Description Default
language str

ISO 639 language code.

'en'
model_name str | None

Explicit spaCy model name. When None, uses {language}_core_web_sm for common languages, falling back to a blank model.

None

__call__(text: str) -> TokenizedText

Tokenize text using spaCy.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
TokenizedText

Tokenized result with correct space_after metadata.

StanzaTokenizer

Stanza-based tokenizer.

Supports 80+ languages. Handles multi-word token (MWT) expansion for languages like German, French, and Arabic. Better coverage for low-resource and morphologically rich languages.

Parameters:

Name Type Description Default
language str

ISO 639 language code.

'en'
model_name str | None

Explicit Stanza model/package name. When None, uses the default package for the language.

None

__call__(text: str) -> TokenizedText

Tokenize text using Stanza.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
TokenizedText

Tokenized result with correct space_after metadata.

create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText]

Return a tokenization function for the given config.

Lazy-loads the NLP backend (spaCy/Stanza) on first call.

Parameters:

Name Type Description Default
config TokenizerConfig

Tokenizer configuration.

required

Returns:

Type Description
Callable[[str], TokenizedText]

A callable that tokenizes text.

Raises:

Type Description
ValueError

If the backend is not recognized.

Display-to-Subword Alignment

alignment

Alignment between display tokens and subword model tokens.

Maps display-token-level span indices to subword-token indices so that active learning models can consume span annotations created in display-token space.

align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]]

Map each display token index to its corresponding subword token indices.

Parameters:

Name Type Description Default
display_tokens list[str]

Display-level token strings (word-level).

required
subword_tokenizer _PreTrainedTokenizerProtocol

A HuggingFace-compatible tokenizer with __call__ and convert_ids_to_tokens methods.

required

Returns:

Type Description
list[list[int]]

A list where entry[i] is the list of subword token indices for display token i. Special tokens (CLS, SEP, etc.) are excluded.

convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int]

Convert display-token span indices to subword-token indices.

Parameters:

Name Type Description Default
span_indices list[int]

Display-token indices forming the span.

required
alignment list[list[int]]

Alignment from align_display_to_subword.

required

Returns:

Type Description
list[int]

Corresponding subword-token indices.

Raises:

Type Description
IndexError

If any span index is out of range of the alignment.