bead.tokenization¶
Configurable multilingual tokenization for span annotation and UI display.
Configuration¶
config
¶
Tokenizer configuration model.
Aligned with the existing ChunkingSpec pattern in bead.items.item_template,
which already supports parser: Literal["stanza", "spacy"].
TokenizerConfig
¶
Bases: BaseModel
Configuration for display-level tokenization.
Controls how text is split into word-level tokens for span annotation and UI display. Supports multiple NLP backends for multilingual coverage.
Attributes:
| Name | Type | Description |
|---|---|---|
backend |
TokenizerBackend
|
Tokenization backend to use. "spacy" (default) supports 49+ languages and is fast and production-grade. "stanza" supports 80+ languages with better coverage for low-resource and morphologically rich languages. "whitespace" is a simple fallback for pre-tokenized text. |
language |
str
|
ISO 639 language code (e.g. "en", "zh", "de", "ar"). |
model_name |
str | None
|
Explicit model name (e.g. "en_core_web_sm", "zh_core_web_sm"). When None, auto-resolved from language and backend. |
Tokenizers¶
tokenizers
¶
Concrete tokenizer implementations.
Provides display-level tokenizers for span annotation. Each tokenizer
converts raw text into a sequence of DisplayToken objects that carry
rendering metadata (space_after) for artifact-free reconstruction.
DisplayToken
¶
Bases: BaseModel
A word-level token with rendering metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
text |
str
|
The token text. |
space_after |
bool
|
Whether whitespace follows this token in the original text. |
start_char |
int
|
Character offset of the token start in the original text. |
end_char |
int
|
Character offset of the token end in the original text. |
TokenizedText
¶
Bases: BaseModel
Result of display-level tokenization.
Attributes:
| Name | Type | Description |
|---|---|---|
tokens |
list[DisplayToken]
|
The sequence of display tokens. |
original_text |
str
|
The original input text. |
token_texts: list[str]
property
¶
Plain token strings (for Item.tokenized_elements).
Returns:
| Type | Description |
|---|---|
list[str]
|
List of token text strings. |
space_after_flags: list[bool]
property
¶
Per-token space_after flags (for Item.token_space_after).
Returns:
| Type | Description |
|---|---|
list[bool]
|
List of boolean flags. |
render() -> str
¶
Reconstruct display text from tokens with correct spacing.
Guarantees identical rendering to original when round-tripped.
Returns:
| Type | Description |
|---|---|
str
|
Reconstructed text. |
WhitespaceTokenizer
¶
Simple whitespace-split tokenizer.
Fallback for pre-tokenized text or languages not supported by spaCy
or Stanza. Splits on whitespace boundaries and infers space_after
from the original character offsets.
__call__(text: str) -> TokenizedText
¶
Tokenize text by splitting on whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text. |
required |
Returns:
| Type | Description |
|---|---|
TokenizedText
|
Tokenized result. |
SpacyTokenizer
¶
spaCy-based tokenizer.
Supports 49+ languages. Auto-resolves model from language code if
model_name is not specified. Handles punctuation attachment and
multi-word token (MWT) expansion correctly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
language
|
str
|
ISO 639 language code. |
'en'
|
model_name
|
str | None
|
Explicit spaCy model name. When None, uses |
None
|
__call__(text: str) -> TokenizedText
¶
Tokenize text using spaCy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text. |
required |
Returns:
| Type | Description |
|---|---|
TokenizedText
|
Tokenized result with correct |
StanzaTokenizer
¶
Stanza-based tokenizer.
Supports 80+ languages. Handles multi-word token (MWT) expansion for languages like German, French, and Arabic. Better coverage for low-resource and morphologically rich languages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
language
|
str
|
ISO 639 language code. |
'en'
|
model_name
|
str | None
|
Explicit Stanza model/package name. When None, uses the default package for the language. |
None
|
__call__(text: str) -> TokenizedText
¶
Tokenize text using Stanza.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text. |
required |
Returns:
| Type | Description |
|---|---|
TokenizedText
|
Tokenized result with correct |
create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText]
¶
Return a tokenization function for the given config.
Lazy-loads the NLP backend (spaCy/Stanza) on first call.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
TokenizerConfig
|
Tokenizer configuration. |
required |
Returns:
| Type | Description |
|---|---|
Callable[[str], TokenizedText]
|
A callable that tokenizes text. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the backend is not recognized. |
Display-to-Subword Alignment¶
alignment
¶
Alignment between display tokens and subword model tokens.
Maps display-token-level span indices to subword-token indices so that active learning models can consume span annotations created in display-token space.
align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]]
¶
Map each display token index to its corresponding subword token indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
display_tokens
|
list[str]
|
Display-level token strings (word-level). |
required |
subword_tokenizer
|
_PreTrainedTokenizerProtocol
|
A HuggingFace-compatible tokenizer with |
required |
Returns:
| Type | Description |
|---|---|
list[list[int]]
|
A list where |
convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int]
¶
Convert display-token span indices to subword-token indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
span_indices
|
list[int]
|
Display-token indices forming the span. |
required |
alignment
|
list[list[int]]
|
Alignment from |
required |
Returns:
| Type | Description |
|---|---|
list[int]
|
Corresponding subword-token indices. |
Raises:
| Type | Description |
|---|---|
IndexError
|
If any span index is out of range of the alignment. |