bead.tokenization

Configurable multilingual tokenization for span annotation and UI display.

Configuration

config

Tokenizer configuration model.

TokenizerConfig

Bases: Model

Configuration for display-level tokenization.

Attributes:

Name Type Description
backend TokenizerBackend

Tokenization backend to use. spacy (default) supports 49+ languages. stanza covers 80+ languages including morphologically rich ones. whitespace is a simple fallback for pre-tokenized text.

language str

ISO 639 language code (e.g. "en", "zh").

model_name str | None

Explicit model name; auto-resolved when None.

Tokenizers

tokenizers

Concrete tokenizer implementations.

Provides display-level tokenizers for span annotation. Each tokenizer converts raw text into a sequence of DisplayToken objects that carry rendering metadata (space_after) for artifact-free reconstruction.

DisplayToken

Bases: Model

A word-level token with rendering metadata.

Attributes:

Name Type Description
text str

The token text.

space_after bool

Whether whitespace follows this token in the original text.

start_char int

Character offset of the token start in the original text.

end_char int

Character offset of the token end in the original text.

TokenizedText

Bases: Model

Result of display-level tokenization.

Attributes:

Name Type Description
tokens tuple[DisplayToken, ...]

The sequence of display tokens.

original_text str

The original input text.

token_texts: tuple[str, ...] property

Plain token strings (for Item.tokenized_elements).

space_after_flags: tuple[bool, ...] property

Per-token space_after flags (for Item.token_space_after).

render() -> str

Reconstruct display text from tokens with correct spacing.

Guarantees identical rendering to original when round-tripped.

Returns:

Type Description
str

Reconstructed text.

WhitespaceTokenizer

Simple whitespace-split tokenizer.

Fallback for pre-tokenized text or languages not supported by spaCy or Stanza. Splits on whitespace boundaries and infers space_after from the original character offsets.

__call__(text: str) -> TokenizedText

Tokenize text by splitting on whitespace.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
TokenizedText

Tokenized result.

SpacyTokenizer

spaCy-based tokenizer.

Supports 49+ languages. Auto-resolves model from language code if model_name is not specified. Handles punctuation attachment and multi-word token (MWT) expansion correctly.

Parameters:

Name Type Description Default
language str

ISO 639 language code.

'en'
model_name str | None

Explicit spaCy model name. When None, uses {language}_core_web_sm for common languages, falling back to a blank model.

None

__call__(text: str) -> TokenizedText

Tokenize text using spaCy.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
TokenizedText

Tokenized result with correct space_after metadata.

StanzaTokenizer

Stanza-based tokenizer.

Supports 80+ languages. Handles multi-word token (MWT) expansion for languages like German, French, and Arabic. Better coverage for low-resource and morphologically rich languages.

Parameters:

Name Type Description Default
language str

ISO 639 language code.

'en'
model_name str | None

Explicit Stanza model/package name. When None, uses the default package for the language.

None

__call__(text: str) -> TokenizedText

Tokenize text using Stanza.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
TokenizedText

Tokenized result with correct space_after metadata.

create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText]

Return a tokenization function for the given config.

Lazy-loads the NLP backend (spaCy/Stanza) on first call.

Parameters:

Name Type Description Default
config TokenizerConfig

Tokenizer configuration.

required

Returns:

Type Description
Callable[[str], TokenizedText]

A callable that tokenizes text.

Raises:

Type Description
ValueError

If the backend is not recognized.

Display-to-Subword Alignment

alignment

Alignment between display tokens and subword model tokens.

Maps display-token-level span indices to subword-token indices so that active learning models can consume span annotations created in display-token space.

align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]]

Map each display token index to its corresponding subword token indices.

Parameters:

Name Type Description Default
display_tokens list[str]

Display-level token strings (word-level).

required
subword_tokenizer _PreTrainedTokenizerProtocol

A HuggingFace-compatible tokenizer with __call__ and convert_ids_to_tokens methods.

required

Returns:

Type Description
list[list[int]]

A list where entry[i] is the list of subword token indices for display token i. Special tokens (CLS, SEP, etc.) are excluded.

convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int]

Convert display-token span indices to subword-token indices.

Parameters:

Name Type Description Default
span_indices list[int]

Display-token indices forming the span.

required
alignment list[list[int]]

Alignment from align_display_to_subword.

required

Returns:

Type Description
list[int]

Corresponding subword-token indices.

Raises:

Type Description
IndexError

If any span index is out of range of the alignment.