bead.tokenization

Configurable multilingual tokenization for span annotation and UI display.

Configuration

config

Tokenizer configuration model.

TokenizerConfig

Bases: Model

Configuration for display-level tokenization.

Attributes:

Name Type Description
backend TokenizerBackend

Tokenization backend to use. spacy (default) supports 49+ languages. stanza covers 80+ languages including morphologically rich ones. whitespace is a simple fallback for pre-tokenized text.

language str

ISO 639 language code (e.g. "en", "zh").

model_name str | None

Explicit model name; auto-resolved when None.

Tokenizers

tokenizers

Concrete tokenizer implementations.

Provides display-level tokenizers for span annotation. Each tokenizer converts raw text into a sequence of DisplayToken objects that carry rendering metadata (space_after) for artifact-free reconstruction.

DisplayToken

Bases: Model

A word-level token with rendering metadata.

Attributes:

Name Type Description
text str

The token text.

space_after bool

Whether whitespace follows this token in the original text.

start_char int

Character offset of the token start in the original text.

end_char int

Character offset of the token end in the original text.

TokenizedText

Bases: Model

Result of display-level tokenization.

Attributes:

Name Type Description
tokens tuple[DisplayToken, ...]

The sequence of display tokens.

original_text str

The original input text.

token_texts: tuple[str, ...] property

Plain token strings (for Item.tokenized_elements).

space_after_flags: tuple[bool, ...] property

Per-token space_after flags (for Item.token_space_after).

render() -> str

Reconstruct display text from tokens with correct spacing.

Guarantees identical rendering to original when round-tripped.

Returns:

Type Description
str

Reconstructed text.

WhitespaceTokenizer

Simple whitespace-split tokenizer.

Fallback for pre-tokenized text or languages not supported by spaCy or Stanza. Splits on whitespace boundaries and infers space_after from the original character offsets.

__call__(text: str) -> TokenizedText

Tokenize text by splitting on whitespace.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
TokenizedText

Tokenized result.

SpacyTokenizer

spaCy-based tokenizer.

Supports 49+ languages. Auto-resolves model from language code if model_name is not specified. Handles punctuation attachment and multi-word token (MWT) expansion correctly.

Parameters:

Name Type Description Default
language str

ISO 639 language code.

'en'
model_name str | None

Explicit spaCy model name. When None, uses {language}_core_web_sm for common languages, falling back to a blank model.

None

__call__(text: str) -> TokenizedText

Tokenize text using spaCy.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
TokenizedText

Tokenized result with correct space_after metadata.

StanzaTokenizer

Stanza-based tokenizer.

Supports 80+ languages. Handles multi-word token (MWT) expansion for languages like German, French, and Arabic. Better coverage for low-resource and morphologically rich languages.

Parameters:

Name Type Description Default
language str

ISO 639 language code.

'en'
model_name str | None

Explicit Stanza model/package name. When None, uses the default package for the language.

None

__call__(text: str) -> TokenizedText

Tokenize text using Stanza.

Parameters:

Name Type Description Default
text str

Input text.

required

Returns:

Type Description
TokenizedText

Tokenized result with correct space_after metadata.

spacy_space_after(token: _SpacyTokenProtocol) -> bool

Whether whitespace follows a spaCy token in the source text.

Shared by SpacyTokenizer and SpacyParser (single canonical site).

create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText]

Return a tokenization function for the given config.

Lazy-loads the NLP backend (spaCy/Stanza) on first call.

Parameters:

Name Type Description Default
config TokenizerConfig

Tokenizer configuration.

required

Returns:

Type Description
Callable[[str], TokenizedText]

A callable that tokenizes text.

Raises:

Type Description
ValueError

If the backend is not recognized.

Dependency Parsing

Dependency parsers (spaCy, Stanza) produce a per-sentence ParsedSentence of ParsedToken records, and parse_to_spans projects a parse onto the standoff Span + SpanRelation models used by bead.items.Item: one single-token Span per token (carrying its governor as head_index and its upos/xpos/lemma/deprel/morphology plus character offsets in span_metadata), and one directed head-to-dependent SpanRelation per syntactic arc labeled with the dependency relation.

parsers

Dependency parsing into standoff spans.

Provides dependency parsers (spaCy, Stanza) that produce a per-sentence ParsedSentence of ParsedToken records (token, lemma, upos, xpos, morphological features, head, deprel), and parse_to_spans which projects a parse onto bead's standoff Span + SpanRelation models.

The projection is deliberately aligned with the layers linguistic annotation model so a parse stored on an Item carries every field a layers dependency AnnotationLayer/Annotation needs: each token becomes a single-token Span whose head_index is its governor and whose span_metadata carries upos/xpos/lemma/deprel/formalism/ tool plus morphological features and character offsets; each syntactic arc becomes a directed SpanRelation from head to dependent labeled with the dependency relation. The conventions below (Universal Dependencies labels, head -> dependent arc direction, retained character offsets) keep that mapping lossless without coupling bead to layers' wire format.

DependencyParser

Bases: Protocol

A callable that dependency-parses text into sentences.

Carries a tool identifier recorded in the layers-aligned provenance of any spans projected from its output.

__call__(text: str) -> tuple[ParsedSentence, ...]

Dependency-parse text into sentences.

ParsedToken

Bases: Model

A dependency-parsed token.

A superset of DisplayToken: it adds the syntactic and morphological fields produced by a dependency parser. Indices are sentence-local and 0-based; head is the 0-based index of the governor token, or None for the sentence root.

Attributes:

Name Type Description
index int

Sentence-local 0-based token index.

text str

Surface form of the token.

lemma str | None

Lemma of the token.

upos str | None

Universal part-of-speech tag.

xpos str | None

Language-specific (treebank) part-of-speech tag.

deprel str | None

Dependency relation of the token to its head.

head int | None

Sentence-local 0-based index of the governor token; None for the root.

morph dict[str, str]

Morphological features (e.g. {"Number": "Sing"}).

space_after bool

Whether whitespace follows this token in the source text.

start_char int

Character offset of the token start in the sentence text.

end_char int

Character offset of the token end in the sentence text.

ParsedSentence

Bases: Model

A single dependency-parsed sentence.

Attributes:

Name Type Description
original_text str

The sentence text.

tokens tuple[ParsedToken, ...]

The parsed tokens, in order.

SpacyParser

spaCy-based dependency parser.

Loads a spaCy pipeline with tagger, parser, lemmatizer, and morphologizer components and yields one ParsedSentence per sentence.

Parameters:

Name Type Description Default
language str

ISO 639 language code.

'en'
model_name str | None

Explicit spaCy model name. When None, uses {language}_core_web_sm.

None

__call__(text: str) -> tuple[ParsedSentence, ...]

Parse text into dependency-parsed sentences.

Parameters:

Name Type Description Default
text str

Input text (may contain multiple sentences).

required

Returns:

Type Description
tuple[ParsedSentence, ...]

One ParsedSentence per detected sentence.

StanzaParser

Stanza-based dependency parser.

Loads a Stanza pipeline with tokenize,pos,lemma,depparse processors and yields one ParsedSentence per sentence.

Parameters:

Name Type Description Default
language str

ISO 639 language code.

'en'
model_name str | None

Explicit Stanza package name. When None, uses the default package.

None

__call__(text: str) -> tuple[ParsedSentence, ...]

Parse text into dependency-parsed sentences.

Parameters:

Name Type Description Default
text str

Input text (may contain multiple sentences).

required

Returns:

Type Description
tuple[ParsedSentence, ...]

One ParsedSentence per detected sentence.

create_parser(config: TokenizerConfig) -> DependencyParser

Return a dependency-parsing function for the given config.

Parameters:

Name Type Description Default
config TokenizerConfig

Tokenizer configuration. The backend selects the parser; the whitespace backend cannot parse and raises.

required

Returns:

Type Description
DependencyParser

A callable that dependency-parses text into sentences.

Raises:

Type Description
ValueError

If the backend cannot produce a dependency parse.

parse_to_spans(sentence: ParsedSentence, *, element_name: str = 'text', tokenization_id: str, formalism: str = UNIVERSAL_DEPENDENCIES, tool: str) -> tuple[tuple[Span, ...], tuple[SpanRelation, ...]]

Project a parsed sentence onto standoff spans and relations.

Each token becomes a single-token Span (span_type == "token") whose head_index is the governor index and whose span_metadata carries the layers-aligned fields. Each non-root token contributes one directed SpanRelation from its head (source) to itself (target), labeled with the dependency relation. This function is the single canonical owner of the span_id scheme and the head -> dependent arc direction.

Parameters:

Name Type Description Default
sentence ParsedSentence

The parsed sentence to project.

required
element_name str

Rendered-element name the token indices refer to.

'text'
tokenization_id str

Stable identifier of the tokenization these tokens belong to (mirrors layers' TokenRef.tokenization_id). Recorded in each span's metadata.

required
formalism str

Dependency formalism slug (default "universal-dependencies").

UNIVERSAL_DEPENDENCIES
tool str

Identifier of the parser that produced the analysis.

required

Returns:

Type Description
tuple[tuple[Span, ...], tuple[SpanRelation, ...]]

The token spans and the dependency-arc relations.

Display-to-Subword Alignment

alignment

Alignment between display tokens and subword model tokens.

Maps display-token-level span indices to subword-token indices so that active learning models can consume span annotations created in display-token space.

align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]]

Map each display token index to its corresponding subword token indices.

Parameters:

Name Type Description Default
display_tokens list[str]

Display-level token strings (word-level).

required
subword_tokenizer _PreTrainedTokenizerProtocol

A HuggingFace-compatible tokenizer with __call__ and convert_ids_to_tokens methods.

required

Returns:

Type Description
list[list[int]]

A list where entry[i] is the list of subword token indices for display token i. Special tokens (CLS, SEP, etc.) are excluded.

convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int]

Convert display-token span indices to subword-token indices.

Parameters:

Name Type Description Default
span_indices list[int]

Display-token indices forming the span.

required
alignment list[list[int]]

Alignment from align_display_to_subword.

required

Returns:

Type Description
list[int]

Corresponding subword-token indices.

Raises:

Type Description
IndexError

If any span index is out of range of the alignment.